WIP: Fast GiST index build

Started by Alexander Korotkovover 14 years ago143 messages
#1Alexander Korotkov
aekorotkov@gmail.com
1 attachment(s)

Hackers,

WIP patch of fast GiST index build is attached. Code is dirty and comments
are lacking, but it works. Now it is ready for first benchmarks, which
should prove efficiency of selected technique. It's time to compare fast
GiST index build with repeat insert build on large enough datasets (datasets
which don't fit to cache). There are following aims of testing:
1) Measure acceleration of index build.
2) Measure change in index quality.
I'm going to do first testing using synthetic datasets. Everybody who have
interesting real-life datasets for testing are welcome.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.0.3.patch.gzapplication/x-gzip; name=gist_fast_build-0.0.3.patch.gzDownload
����Mgist_fast_build-0.0.3.patch�<kw�F����h��x�_����{���I������D$a�7;����[��=�$�9�uwuUu���Z�jU��I<��u�w2����P&���O���;9��M�V�Ow����o��q���4��������X��z}[���~8���S��UhR=A}D�n�������(�A=��y���� ����p>��c�;'C7��?�!�?|:����0|������2�4C��t���1���o���4�>r��PZ�};�d�|
�k���>!S�S�D�v��I�����$�bw,�����G��>y�kG&�S9�/��MMlW��Q|��c��n*�(��������D� @<Eu��3�������
��.;�Ap:h������aZ*����Ps�4N���oY�v/���O��(~<��T�OE:�������@����
$�����h$��L��1O�'b9��w�
W�����b0��I)���D��kZ�R	d�P�92�3`�x1q��Q� Y�Hz?�;�A��4�^8L����h}��	G�1����&���8�y8L�(�y!�	/2���������u*�d��0b����B|/��*&�t���M�SQ�w��L���(
D�O�y,<�5�$��/M�S���`�E~��85�j�0|��p}�? �������D&l��_��bi�������:47��B��!erk�N�b#���Z��%���?E�e��<�����!�G�D�W����Ex�h:�
f����	H�
f�~2y�E���V�*��A�in�p`�v2G����O��[��H���X�M�����/'������C�����iK<���%W����V���e��*zN(!����!���M|tg����j-�{�tw��N`��w���$�9��"�����LQ��h
��ry;�jA�z�S����O'SS��F���b= 8/�qS�
<�d���{��������A�����w��h���ii�B���V��l���������+7���1���5y�?�(��g�a�i�
oY����d2@
b����V<�����&���>��R���K!<!�(�B�(h��$�l%��!
����#�**��.:#Z��������bz���E/���X�^���6EB#��Km��G��hx��i���8�
Yu	�n���]����������/@�@<�`PXxn������B�r�RMP1l"�������i\C?�<ky���ni2C��(�
�e#s�(RF���2���J>���
��iB�7�fT�GW��	��������u�oW�<�&:�X�=H���7�9�u7�6iR9kc[��
+����n��"����q�s?N'���
��x����v����n4����4q�R]�4�1���Z�_u���6���E�d��W�wF�-�s C���ZD#D��YW��������k�H�G�q���z��lgx�wT@��
�J�������������:o�s�&�T���i�U��sC����:�:�����s��>�$� ��a�E��(x�p���=���:_!, ��ZP�Y9��g�����2�Y{��7 �dO��54�����!cg�N
A(?�)���"��	M�4E��l�N8rW�����C�Z,�e!��1�������#��+`�m'���7%}�T����8�[��y&�>cy���2o>@-�6N������-��+�3������x�hF����������=����#.�Y���3��s���HmY��b#���
I��x�����o:�������]e������������F�M�`G����b+�#��j[k�;���j��k��Z&j�6��-�{��N�=���\���[�Ad�f���p����O�����8w���Q75���*��U�����|�^-FFE���P<&
�
�-���<�!t���6�q+#����@�M
?���_�FKJ����A���!��,{��7�q��`W��\"B��|�>�l��X7����N"O�<?��)��	{���p�����q��@���Y/�+C&P�?��`���@=F��d�>e��,�$�����YW~���p�����
KU�N�[V n�\��0�A0��W�x88'p\���:���	��r�@�i���D���V���`n<�P�ms@z��������E���8=\�*��k�v���['�������.V�{�����Ec�uUK�a
!dE`2XoT�I�Z2?TB�7��;��I�W0���A�3.����]�u��cKz���6����������?p��w��l}��v�d�_���D�~�A�K�D��RW/Q(�t�>K����r^��M�������{p�n����L,A@�|P���4��p����L�o8����R�FE~T�a�I�b��p�d>������;
�0Q5������[���$�a^$9_py��!��D^�'�ERWB��s�be�u)���q-N��YB��W*��wL���k9�{Y���a����B!k�9�;��}I$��
g�`_3�4�cj��o���4����nA�����mG�y������[�FJ:�b�����&hU�����MG�>��1���tG�
)gi%�D�;�F+p#C�=_�����)�L��^�Y�'Q��2sWy����n����C��RW��[�����������R���A�\z��./�|���R�f��'QA\���8{{F@�,���-wHbq�����_�Uio�9��-U	BB����V�0K�t���<+!kw�������W��eQw����F�W�Y�������.���e]�V������D7������1Z����������w'�Pp����o���[��)
gQ���:��:��	���L�B����%[����S�h�4����ea_���(�>�_JU�t�8��������X��K���u���:���� v��S�>�m��l)v�R���l�A��Fln
���&G`���AG�#�h���D����S������f�|��c�B.��9O�;�������[9��9V�����j�(D���Y��D�%�z�O�"�Pdc��]RTY����t\�C.�A��9�r��-�x<�v�H�^F�������+�Y��]b��b�������������a�9:����Yc�
!���%`�XsH��1�,���>���a�����) R�<�x�'p"����*���#�4�y����Kg�J*���������T��������C)>2tU*�� �ke�1
9q�1��(�c���('�X/��1�����y��%;l�;��V����MN�
����l4����+��Q>!��uD]�Q&�!Mp��(3���lsp��sw�v��{b^$BO#^�z�fc��?���'t���r2ptIh��l����}�6P�f�Tq��Y�m����2���CS��[��3�`m�ZM���=ghC6;A�Y4{4P��Eu|&O�^u���V���Q���=;���x���i�
�4Av�P��*����>�"�d�c��>F��\��lzh���C��Hu�ms'�:�E�k�����!��,t�w�����5P�QT/�!.���X`L(Db�����`������_r��"��Q�@��\�:�%�������9�������@d�#o�xO�UZ��*���
i�Y�x�h@�����%�Z�~�<-��l[���][�?�������/,]A��^eU���� ;��-��Y�Rb��V�`[���0,�'i`���A�)�,K��.�I&�Gp���@�l�6[�QV5_�Z\f��������]�t(�@�ecA��3��?gB0~���������=�����9���a�x�FF�?0����Ss�%"��=y���1��f���
S7��F��P�p[b�Z�R��C��u=
���n�@Fx��yu	�a��xC\����@�k�*�~qA	�b�a,+�yt�_k5�M��2������n
�sy/�hF������)P�(c����$�m�?N����AD$��ka�9o_
:�3ZzM�s-��e�o��{3��xcfK��4n2t����(��b�S�]�n���o��j��o�Zc�L�d�zS�A�����U��M!�}����	���m�_Q��5n����Ux�qk�����k���U���*��U���B�~����=�WA�����e,3y��0]@�G~��Rv�v`�qy��_x��Q�?m�����V�-���o��Ii	+
����[���:�^��:*����Y,���diE�����g���������~R!��������o�"�����A���Ll�P
�C������>��,|.��������I������/�-9��5aN�����
���'�����w�>Z|���E3�x���,��J�m�b��+<�M�tc�����Y4�#���N��F���|
���,y[�ieD) @��b������9H��\���G�/�w��7���U!+=C�+:C�����UYI�����27�aj��2����������A�p�W�[��D|k����X�~��W�D!5�4�z"Z��:=�M��H������/�n4Or�hTU*xz�H7N�(Z*������|�����Wh�����(mWq�
~�4�M���A��������������
�^����-w�������\1�b`�PC����s�2h���F.������t��b���r+�2����zL�
�������Dn���d�K����a�R�5u@YOm�W��u��y��y�~|J��mJ�|�M\��:�r��{������*����4�o0 xUQkkiV������c[r����}�'_��T��/���U��Yk�y���=s
���^��Aq�V�p]��Z��X�u���*�|"�,�����^�<�2���O{�Jn�u�9�aA�3J�����2���1J�C��7��@r��V1��Z����f'������1
�W(,sm��TX}�^*�-��@�#r�&���uGe>�������h�K��P���x�F� �`�	��������N�vY7���j�C��J�b�)0?��
��K)��Bg��^*�g��@�����"=dV�}����b���ka�6���!X�;s����W���
<*�=����[qm����W1`w����
/��7P��+�����=H���\�1u���8<�����%m����x���C\W��UO� �M��a:�,�M\�7�s����v:��t��[!�~�~<H���o�Qxfh\���"�k3��&.��2_1�xO�R���d����'�W�1�S����na��<��YlWJTE��Ki��������<����Kj��\r�t�g,��>�����^#7��g4�J��������)����~�m}��� �sZ�������4J���.(;�?�C7�e���o�������5(?��[�R��tQ���ja�h���_���T���}��#���<��
��/���B����~�o"R��!q3��L��.��4��a[��2`?��[$kC�S�=R���Q._;�_�T�?�C�
��N)D�s��i��1���Q��,��T��%S\��`�����\�
z�������)�qt�`��b|���o�c���-�����W�h��*�����(�XA�5l������'���d=R�OD�\�g����f�1�P�%�`Z���[�EM0����k��k�&b�������!
Iy4P(~�k�4�J���>=��Vyg���FX�J�X�q���ld���l�e��������$o.��,c�18�.��h�oY��<���ak,�����M����@��A39*N=�`Ve��;��p���������Q[F�v�|�g(����������&Q5~�q�������Mx�85rv�
���p�� r��
�G��� �I4u��G�kl�p������#��@�'���K����,�O��]tZ��{�pQ4T���?�������K7#uh���������X=V/e�������#����;����:!8�m9��co���`��+	;�,���yt�9�����Z*v�LwO?N���F`�w����k������U�?7���������Q�$s���5����o�>9fD	+�H�� ���S

8[W�)_���GH��"Q������P�!�%��0a�GW����}���HGql+`�tgf����]bD��LX�g�
�t0>����h��B�� `o8�\����^��Z���[�K�W9��
X���a�-k;�dy��6�P��q���P�s$��1����@�	�4B4	����Q����q	�s'f_>^��l~![��
KpE)-~�_7`j$�"Q�R���lC�C�2u�c.�Y��'$y���%l�^*�l������`��������L����8]��|\��>��KS')�C9�$��*�,m�2�'��7#�TP����di���P�+����}:^�2K��E^H
5u��(�dY�����y9�;t��Vd7y���������kS�l��d����CGGX��`�:�"��h
n�s���B���YDp|�Vzu��F��J�:OXT`�y����	yi�fv�\enX�+yq�/�v���'�G~�9C!��ipx�	9a�-�~�Hu�&:������"����N�{�i��C�O��*R�Qw�/`�w%���o^m����U2
y���M>����+U�j(���Z%]d'��F2
��������nr�V���}�?b��:��:��Y����� ����fKw�>�vH'l�AU��������h��^F���AC���d,��9�����]
�
0-���"���,�!�������=\'��������v7�#^�v:��lf�k�!j
P�=�$!�����J�w_�F��e:���Bc�	c*��w��W�"����;��w 4��h��;+;0��2�"�.,	Sx�}����cn|����/��5��o����u��d���������w9������K�}�x��s.��|
���!e6�y�
�A���^Bi�I(�\��{jA�+v�����.B����W�r ����������;�J���`j`��sGV���
���:����.v��5���{@�:�1���4eS�Y����G�F7�MYo��/��
S��s�a;d2������w��d��:	M��<�����!�fD-���B<)��*�RNJ!�������EY0FT���AG�~S<�B��`��g�!���%�~;��/H]����.t���}���Y}���Q�Hte|���"�d���~�-��ua ������^�b��ao��}��9Ls��I�K~����������{z��[��e\��u
�s�`�
M�D�r�)��r�09���T�k��K&J���P��Z0�����Y���2Or~'��v�"��"����9|[����ygxJ��s�X��~3 N���N.����S%��������,�Z����G\�c�����������M�����Sl�L��8-E��\\����x{��u��
�������wF��0����r+���a�������MNQ:[���`����F�f0�������j�H�w�B����Mm��-�WW��c(k���^����+���d-�y�)�8��R��X��_��o�7JI���������c���m=;����1�����J�8�)��^tI�n5Nd��pL��L�`N�?�\b��g�^�X�Qx��8-/��Y�t��Uc�ziZP0�`\~���7s���~���
�����e������2��e���t.�u��I�����i"O�U�{�R��_:���x2��%�h<�D����Y|9Z��:A��Gp�j*O8N8��Lv�mnK�
W����I�p�:_��P#0��#�2�7F�(98c�S������(�b'�lI],��/������>���u�8��(g�g���H��^��WA�x5&��<�;G}#z�������1+s9�p<MPv��x���
w������������.�����N%����pK4i� 
��5��9��J����@�b���2*�sc�k���c{����P	D��G�:����o�	e�����8���X���^���-G���������L���vFa��x���oG�����,���o\z�����xi��������F�[�_�hKj<!G,}�h��Bt<�sIX����L���Z�g[H����p�y;���� 4H�������n��/���Q��5��"��Q��?i�e�&#�9z9��G/��A�ZR��E"�pHB0h��t��zo�i�����;�?1@������>ET�����=�5w���[I��Gji�'�	�b�
���yqP��Lc���S�_�@�j�@�j�@hQIV@\w�%�WD��D�O�h�<;E��(��D���-����9���b>]p�L��/J�-���]o�u�j���@�����VY]�^�U�kWY]�����=�&h�w�Ei���c)� r
=&q��TQ�2;�����2�E���������5<�lE�����9>�j��
��#H��5��u�z>a���������*m�C��LJ�w�,x��2R���'��rF�S�P�UW/[�}sAcH;p0��e�O��w�U�<��M�c��*y4�T5S�z
j�q�|^�H
rM���RU���F�%��%�?����WZ�_�g��������[���z[�Wo����mUT_�WT�W��*���VT_-l�Q��:���+����VT�g��C./
e+m!u�U]��lc�p�9�&�����!�������p�?b������8�����I��yY@/3;�<{����l`������\1�ba1��r0+�B��LH@��6Uq�:Q����=�"�"��Na���;6���;�_���\���N� Q+2���[��e$�?�u���<[+3�?��������d�����	Z�:>����WPZ����s�������}tr�dD$���0�M��Cbq����h���8������u9��q]z��K:(J���A�A�c���u������_�?�Y�>*g��n��Q�x��/�r����=�8�����_�ww~���_:���/G���M���( �U�&[�Lr�_���;���S3x����_�>\���j���nH��H��F�Xx
�vo����^�x���Wf����R[�	*�6]�Nn�������B��V��P�W`G�/�_~ �B�/(�L'��6[�#��]�`+�����������[�Eh���^�+?��Z���(n��D�F3E��*l��f�]��,����bL;�C����H�@P������Z� ���/�%BB!��=�dl��_�����6��INW�$���EsN����$�)����C�g��������3r����6"4|�*��e�Z��*�"X���=NA�Se�Lu^��Y|4X���?��p�z	LO-%�i�����}����+w�;�q�7�q�e�N��r���Pb;�FZ2���g�����D:l���������QY��{$���	(��kU������/��=d@��rx���c��*�c��3�m;j[����	4U|l��U~L��^��6r'�
����E%}�9��>�F��UOe�E�<���~{~��>��w�^�~��g"���-�
D4�~{���oA�:������y!�/a�40�7���?��Ae�dj��{����{����3�>�U�s>�q�@5��mW�Et��H��:��Jrs��K1�E^�61�c�r�.�#����O�Z��"	T�����S��ln��|�j�9d����-v����c-S��Z
����PY��"m��m�u�XI+i�ouM'��>1���:���Z�,s���>���t�g����lf�����H��X��.di�{-�����U��D|���a���'�:&8��[�P��]���������_4"��)�	�1�<�W�"x�W�g{��*]����[_�e��g.��^���dg�;��yn=d���*�9������-�=�,
�U�1N9����,kY������[�9���	R{I�O�"S��a��~^F[)���0?�P�S:��NLP�r�4~����j��ND�9�R���p.�2���&L��:�����g�h�������tZ!f���}��b�=���;��p���h�:e�P�k�*#��#�����Q�c]�rH:�#��8���,IM%������[�~U$�)�
Q�c�E���&o���a�:���n�����=�b��*�C��
B[
��h_����-�/��:JAF��#�����N0vu_�������4�^�:��!�h`gR4Q�d[�IU�������#��x���@f����{&K��~�\&~MJ�>|���j��+�������Z	�����Lw~SN��r(
�~1*O�{��C������l?��V��T�j_����A�m�QSMV���Y)���xE����~������7�&�%�c���}'���~4Rw���+����B�];�w��jK�[8��$pq�{��C��A;���<!{u��_�p��M��W~<p�lG��D��&�:������y4�bC J�~��pX,0�d����������F�|���^W�T��������W3���jw�6��>�#���3�ox�w�gT}!���F�F�:������H��A"�Cmu�������@*Y������i6�O'�S"�����7M�������������F%TO�n$7��������``�]g@��L�GU=9;��C����w��#��������u�/hd������������/�'m{i�*T�'�C��R"��/�n�s�c��E����o`hz�,m��K-'����3l�7]oM�T!����G�����Q�%@"=�#I�w9q�mGN=�'c����)o�R�p�Vh���NJ�@5��q3���s�&y�#>��W��d3�T����u����-F�t_���a��������mK�)/0������S�
�\J��hy�R]tam�}��99o8
��.��@�N0������19�����e9������_�|4�*��/?��o*�;�����,h���o�y�j�����b1x���A��w�e���'�����d�����s/%���Q	������ui������W�?H%�|�������zD���\�Ou�����{eG�56�e�mveR��������O4W+�O<�9�C�*6���4���E/��\�t�f"7HDc�1�P���c*d[��w�>����M:�����O�����Z��>>�WIq�ZX0�
>E�%s97��-��c�ST��KwR��<��	��=W��d����.W��%t%�S����WQ����-�K�O��-�lF�AO�������ID���_"����#N4�Mcy�Z@,�SI�ky-f��T�;��y�!tTRB����������&7���b*�}[��$��j"��~�
%�u8���j������+{`�+�����KE�9ww��XqT2�����,h��	x��S�# Q�F�j�	Y>����s��l��he�0���UF5�Q���t����[f�{������������aU]��`�Y�!�r<���o��!:�c�P�rE����H0D����X��4����C4��#�W 
��_t�s��q���'cG���dz����i��n�7S�����r����s����v\	����k�Y���;�������ck]�t��s��6n�i�3���TF,���i/
%�v���J(P���4�	`u���>��J�w�����!�[Ti9�;z��n���K��j#Z�,�����&q�Q��1��px�)����B�uo��E'���I��n����w�	����+�(L�u�,��7���T/~o����r�8������4�t��y����`z�VI����d���LW��d>�Z��.�R�)t{�������TmQ��frf��Y����"jm��K!��Nk��
��fO�YT�G�#�8�@*����_�x}��iKa������o)z�~�/^J\���,��>��h8�����qX�%�QF!�M���m�F �B���������@Q�	CY�n�NTt��>��n5�����Sf�)T��l�D�Y��L���"���$|D��/�������*-� �u��*�~U.7`������QEB*?��!Q�i4��"���Zz�������6��#���c�������@g;��3�7�k�����f��u��"2�����Q��v��;OrO�S�P4Wi�c+�]��t��pP�����m9T7&��"M����\&�w��a�#�&N��K���O�� m �����:�W��V���<��v	��������FF���&#5��i�2����J��]��FR�YziX?��0��]��9�������Xi��3MW��=��
d
������b��
��s!G��\n_�V[y#�����Px�0��.&�]\��'��Kj�a�d�67b��W4~��A��J����,-uO5	�Ju ��8GQ��6Z������(�a0�Z�C���D�	���2�Dg?F�<���l.E q��P�G9�z�%�I*(���6�����A��9Cr�`�
���s��g���-���h���o��S��:����� ��\��Y���\mkZ���f{��3�N�LZ��~�v�n�����8m�����Kt8a
��t�����d�
\zz	���Lq�{e~[���n�3P�� vj�24b�&Z�<y.>��'��i9��m���|-���Eq���hm������q��j����o�Y������e7��M7���D����'�D���mA#$���'v+6xUL���� �:�:j<�l���^��
�w+>�P�K�Z��?�U�6$�$�%+�U�������
����2��[dU����D�.�x�k����x��5��J������1��u�h5O���d�^g��k���n��T�������J���\��P�T��bU�t{f2��)kV��K�(������WnY��U������
���Gx��{��D�1�>��I(UE���(��9�����Lj����mev�����8��8z3���h������\*�@��l�Rh���~��P9���w���^Q��A�j��d����*��M�2Lh�5
��:'����Uj��+��,�����'2��=|M��Lr��3sD��cM��1W�����|�������r$-�������&W)�a��}��-��f��a#�����R�%���J�Q��"����kKK�b��{�T9�%+�@��u�(\����-�������.d�{���/�_]�sNaf`�b�Ol=������{�&��	�S���3��@�8�N�>��9
V��%�%G$J_��,xC1R+�H�w�MM�*��]����X	��<$�W�	hC���Y�i����P6Xk�,�;��`�b�U���h�s"f��~?����boV�QX����V��<�B�TY�^�����4�c����&;V�,�D��l����?�WX*����`@;\1���g�lw.����7�B��s>�>0�P��|', �����A��J�;�8}�M�N2�NdZ�����\<���I9\�K�p��N ��R%O0_��P������)�pS�x���2�C��G�����r���+�ob1��a��}[��b],|�3�c� 4��(�3$�7�F����y9~��S�F2�����0O41��b0�bF��/�������a��6�Oc�����]���s~y��D�+]~,�4�}�_��i{qQR�i�'<��6,�`}1�B�(�g������m�'����	vC���j�P����P���,t-Q��I�,�tH�q'7��)��R|��5�)+��=�bV2���Ap%9��okk��_��Rw�>��>��_���z��2$����s��G��d�����������oE;���	G���t��@�}��	$s�\�����91g�k��=�|0{/���dN�cv�f����
�q�h)�a7�l��x��cMK{�y`�$����T��}
x\@@��"Bp�������`��(��QY�����a���ec�O���pop��ud�������I#�y�����lK��/0�2�|~ynP�lf����3����
�5?m�rh�����>��O�����[������
��A�=:�z�����L�m���=q-�-W y�����_���������?��������h�;(go�|���~�E���`D��43��D�-��a<?���aH.=��������\~��*�W�������Qc�m��:���}��^aw�	��bV�tc�w�U����������#)&/Qu��;h��>jN>��mFJ
rvW�i
��������i���Su'h�y	��J�~;SQ�6)��P���]'�'�,�)�V����&��7�
�l+"����!�q���r��*�*����s�\���t����fJ���HJ���f���d��;B��Qu����$�2�:k����E8���V}����br-MR���t��r�����~CUv}����w
��`���J0`�[)�`1�ma���@MnT\�*�9]��j��P�Nx�L���%�F��J��O�f2C�c�X�(�lV99+C��?��fb�:(��3}a[����.�e�3�CN�S+V����c���8����b/���	n�%����9J�cG��W���i\��x�u��x+�q ��`@O2s����K�-������$D��J����:�������v|����<ql���*�dN����'{�Gf]��� IN]�C������K~*������
w�h���TE����s�'��?���m�<�@�c�qz��(`�g�i����w�|r�����y1?s2� d��<L9m����[B<�t8��a0/H�����4�+��'i�q�hZ���\Tgc~��@��!�����7�Rfq�m�-��l
Nt���Zb�?��*]�;/`0����^�[*T��v�O���%s���:A��x������@��p�D����#i�lPU.�d��c@Jt��e2���.(��&�6q0G�����_�5��i��r���x15~<��}���wmS��&�M4V�r�X���eP��$�?����Q-u�(��VM��8�4JK�,'����*M�������vRSqEC�!wEE��L�����$1^����|����[��_g�
�)8|:��J������$�m�7����C�>N{�����e��[������v�������v e�*BIn"{P/���E��]������x�2\��xu�"�X
�-U6�MR
Wq��V�����[��yU���6J�ve)o��VK�T�d<�}�Rw��6m��0�/��*��}5L6���p��,5��D�����X��Q��&F�M����q���\6t	�����76�l+��$���
@���#������������������|QBQ�Y9�o/m���	eA�����\Y�hK��Tk�d0n����xAz�
��
�R�%�d�����DM6�}3����/dt������U7
�:���ZS�k�����[��'����<vS3!���ce����v��
(7wC��O����<���M?v[�����������#�b��Y�H�����tjp�-���U�K_��7�������r�oY��Vk�/�?8���;����7,`EC=���n7S��2-i�r�R���<��yG�]�<��������2���_�
�]|�����������
�W^{J��w��fA+�"�fz0������������O��YW��v�~�[�j���
�#-����_
G?V�=�� ���?y��qS�vMlfG_"/�#O��S�_f��#n�U��	}���`<�
�.��#�����&O��k������P�`������	x7��#l��k�2�/G�Y�{��-A;h�R
\M1���8��Z��F=`�������~�����y�<����v���)��S_V��k9E�E��#`L�Y�5)�`O2�U�T���������*"���*����>�]��6)��6��2�_�OD�?*�]/l6�]6���w��������R2s]�: ��FQ"m��u��fw������\���T}OO�j�j��.6����W=��&��k�n���zV����|d
�
����'F��P�m��W���.����Y�J�u������N0�s����w���K��Z��������Et��NvjnWKt��SO|�VH�-�gI)H2M��Z�$Pm\�lgN�KCm��Vd�s��
7"��xS��(y�'�P~GeU�+�U���P����
�&�Vm�
�@5����g����Tb��#��

&-7��Yj��d���~����<,k�E�9e�hf�B�V�/<UoG���YW�%KL�v,�C����F�`R+e��ZZm=�h k�\b������0��olTNY����d��/���x	W/��V�)f]�DcN�z7v�R�*���-�u3j��;�1jG�V�lf��>'��U�JW�"H6��2(���r�	
|
2E8%P���*����vm����?��V��u��#�j��8O\7������l�1��.G�E�m�f�1,���w������V������G��+c~��Aa��v�(����{b����R�@'��/�3��������[�:_.�g�������� �%����I�p$A�-�6���8Hs)��_R�vG����1+tzv|�a*8���#�o�:f��?#��3�W���� �n
hO:�q���rB������Y����?������a����}!��#��=,
���?i���x�cL5J��W",�H�u4�huG�UjN���Mb�����K:��:�+� _A�@@�,mE�����#����M�SEP0�6�Y?�P���9����������+6SW��M$&� �N�)�m�����%��(���������^�3l���H������ �!�}
��_��>��b?����O�(��aNF1�����O��!5����7'v|���,�b/�n�����!IW���X�!Z�6_��a�`l�]��������4�������{�5�^>z�����7v7�0#`������}IU���6Fkani6�N���EZ����hp��9�.@K
\�Fb���������Bna������=z����q��K���u?����tS9���&ai����p[~E�!t�_/v�`4����T����h4��'���N��J�������B��8�r����^y=�?�E
��7&���<��~��@B=$�����G.Zs�J�:+���J5J�Q�h��1�S�X�9�%.����5�p�����4��>*ggX��R+Xe�\@2'�^�y�Df�M���������hL�h0[|(�0��\P>��
�|��+�6q������3���3� ��yj���WD�*�O%��������L�,���j
�)�@,
�$���I�#j��rH?o?������KP���-V������&'��g�J��0�B�����S(��Y�5"z�����*[N7�B^���s=D�`�1B3�f!+h�������$���pyU�H�-O�
2��A���"]���K�y9�IaE����\�Hk����������{��HB
6�O��2,��{2�����S�R����o���q����I�������1���~���i�����^�����&r��1���Ypo]�r-������O1WHg����!��x�C��vM�2��6IN~T�`[x4��
E�3��� "�>]~S����K���%�a~'��f����w��"�������%��LL��TQt�Q����r��U
����:�~{���r�
J�`�
s��mY^e�M.���z?1\��'
N�1�,�}�[���T����I��~�Xc�-��+V>t�Y<��3t�P��,j�X�[*��87���f��>,�_����y�|l�d����iR�����u�+?�{"�0���dQ&Y��=D�eDMY��
���n��F����1�[�u}��7�qkY!�l(��`���9��4��q��k![{(��vLb��KC~a��`E�|p�������p����}�qS
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#1)
Re: WIP: Fast GiST index build

On 03.06.2011 14:02, Alexander Korotkov wrote:

Hackers,

WIP patch of fast GiST index build is attached. Code is dirty and comments
are lacking, but it works. Now it is ready for first benchmarks, which
should prove efficiency of selected technique. It's time to compare fast
GiST index build with repeat insert build on large enough datasets (datasets
which don't fit to cache). There are following aims of testing:
1) Measure acceleration of index build.
2) Measure change in index quality.
I'm going to do first testing using synthetic datasets. Everybody who have
interesting real-life datasets for testing are welcome.

I did some quick performance testing of this. I installed postgis 1.5,
and loaded an extract of the OpenStreetMap data covering Finland. The
biggest gist index in that data set is the idx_nodes_geom index on nodes
table. I have maintenance_work_mem and shared_buffers both set to 512
MB, and this laptop has 4GB of RAM.

Without the patch, reindexing the index takes about 170 seconds and the
index size is 321 MB. And with the patch, it takes about 150 seconds,
and the resulting index size is 319 MB.

The nodes table is 618MB in size, so it fits in RAM. I presume the gain
would be bigger if it doesn't, as the random I/O to update the index
starts to hurt more. But this shows that even when it does, this patch
helps a little bit, and the resulting index size is comparable.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#1)
Re: WIP: Fast GiST index build

On 03.06.2011 14:02, Alexander Korotkov wrote:

Hackers,

WIP patch of fast GiST index build is attached. Code is dirty and comments
are lacking, but it works. Now it is ready for first benchmarks, which
should prove efficiency of selected technique. It's time to compare fast
GiST index build with repeat insert build on large enough datasets (datasets
which don't fit to cache). There are following aims of testing:
1) Measure acceleration of index build.
2) Measure change in index quality.
I'm going to do first testing using synthetic datasets. Everybody who have
interesting real-life datasets for testing are welcome.

I ran another test with a simple table generated with:

CREATE TABLE pointtest (p point);
INSERT INTO pointtest SELECT point(random(), random()) FROM
generate_series(1,50000000);

Generating a gist index with:

CREATE INDEX i_pointtest ON pointtest USING gist (p);

took about 15 hours without the patch, and 2 hours with it. That's quite
dramatic.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#3)
Re: WIP: Fast GiST index build

On 06.06.2011 10:42, Heikki Linnakangas wrote:

On 03.06.2011 14:02, Alexander Korotkov wrote:

Hackers,

WIP patch of fast GiST index build is attached. Code is dirty and
comments
are lacking, but it works. Now it is ready for first benchmarks, which
should prove efficiency of selected technique. It's time to compare fast
GiST index build with repeat insert build on large enough datasets
(datasets
which don't fit to cache). There are following aims of testing:
1) Measure acceleration of index build.
2) Measure change in index quality.
I'm going to do first testing using synthetic datasets. Everybody who
have
interesting real-life datasets for testing are welcome.

I ran another test with a simple table generated with:

CREATE TABLE pointtest (p point);
INSERT INTO pointtest SELECT point(random(), random()) FROM
generate_series(1,50000000);

Generating a gist index with:

CREATE INDEX i_pointtest ON pointtest USING gist (p);

took about 15 hours without the patch, and 2 hours with it. That's quite
dramatic.

Oops, that was a rounding error, sorry. The run took about 2.7 hours
with the patch, which of course should be rounded to 3 hours, not 2.
Anyway, it is still a very impressive improvement.

I'm glad you could get the patch ready for benchmarking this quickly.
Now you just need to get the patch into shape so that it can be
committed. That is always the more time-consuming part, so I'm glad you
have plenty of time left for it.

Could you please create a TODO list on the wiki page, listing all the
missing features, known bugs etc. that will need to be fixed? That'll
make it easier to see how much work there is left. It'll also help
anyone looking at the patch to know which issues are known issues.

Meanwhile, it would still be very valuable if others could test this
with different workloads. And Alexander, it would be good if at some
point you could write some benchmark scripts too, and put them on the
wiki page, just to see what kind of workloads have been taken into
consideration and tested already. Do you think there's some worst-case
data distributions where this algorithm would perform particularly badly?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: WIP: Fast GiST index build

Hi!

On Mon, Jun 6, 2011 at 2:51 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 06.06.2011 10:42, Heikki Linnakangas wrote:

I ran another test with a simple table generated with:

CREATE TABLE pointtest (p point);
INSERT INTO pointtest SELECT point(random(), random()) FROM
generate_series(1,50000000);

Generating a gist index with:

CREATE INDEX i_pointtest ON pointtest USING gist (p);

took about 15 hours without the patch, and 2 hours with it. That's quite
dramatic.

Oops, that was a rounding error, sorry. The run took about 2.7 hours with
the patch, which of course should be rounded to 3 hours, not 2. Anyway, it
is still a very impressive improvement.

I have similar results on 100 millions of rows: 21.6 hours without patch and
2 hours with patch. But I found a problem: index quality is worse. See
following query plans. There test is relation where index was created in
ordinal way, and test2 is relation where patch was used.

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=4391.01..270397.31 rows=100000 width=20)
(actual time=1.257..2.147 rows=838 loops=1)
Recheck Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
Buffers: shared hit=968
-> Bitmap Index Scan on test_idx (cost=0.00..4366.01 rows=100000
width=0) (actual time=1.162..1.162 rows=838 loops=1)
Index Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
Buffers: shared hit=131
Total runtime: 2.214 ms
(7 rows)

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test2 (cost=4370.84..270377.13 rows=100000 width=20)
(actual time=5.252..6.056 rows=838 loops=1)
Recheck Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
Buffers: shared hit=1458
-> Bitmap Index Scan on test2_idx (cost=0.00..4345.84 rows=100000
width=0) (actual time=5.155..5.155 rows=838 loops=1)
Index Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
Buffers: shared hit=621
Total runtime: 6.121 ms
(7 rows)

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=4391.01..270397.31 rows=100000 width=20)
(actual time=2.148..2.977 rows=850 loops=1)
Recheck Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
Buffers: shared hit=1099
-> Bitmap Index Scan on test_idx (cost=0.00..4366.01 rows=100000
width=0) (actual time=2.052..2.052 rows=850 loops=1)
Index Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
Buffers: shared hit=249
Total runtime: 3.033 ms
(7 rows)

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test2 (cost=4370.84..270377.13 rows=100000 width=20)
(actual time=6.806..7.602 rows=850 loops=1)
Recheck Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
Buffers: shared hit=1615
-> Bitmap Index Scan on test2_idx (cost=0.00..4345.84 rows=100000
width=0) (actual time=6.709..6.709 rows=850 loops=1)
Index Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
Buffers: shared hit=773
Total runtime: 7.667 ms
(7 rows)

We can see that index scan requires read of several times more
pages. Original paper denotes such effect. It explains it by the routing
rectangles in less optimal ways. But this effect wasn't so dramatic in tests
provided in the paper. So, I have following thoughts about this problem:

1) Number of pages, which was readed from index is too large even with
ordinal index build. Querying of small area requires read of hundred of
pages. It probbably caused by picksplit implementation. I've version of
picksplit algorithm which seems to be much more efficient. I'll do some
benchmarks with my picksplit algorithm. I hope difference in index quality
will be not so dramatic.

2) I can try to do some enchancements in fast build alogrithms which could
improve tree quality. In original paper Hilbert heuristic was used to achive
even better tree quality than tree which was created in ordinal way. But
since we use GiST we are restricted by it's interface (or we have to create
new interface functions(s), but I like to avoid it). I would like to try to
do some ordering by penalty value in buffer emptying process and buffers
relocation on split.

3) Probably, there is some bug which affects tree quality.

Could you please create a TODO list on the wiki page, listing all the
missing features, known bugs etc. that will need to be fixed? That'll make
it easier to see how much work there is left. It'll also help anyone looking
at the patch to know which issues are known issues.

Sure. I'll create such list on wiki page. I believe that currenlty most
important issue is index quality.

Meanwhile, it would still be very valuable if others could test this with
different workloads. And Alexander, it would be good if at some point you
could write some benchmark scripts too, and put them on the wiki page, just
to see what kind of workloads have been taken into consideration and tested
already. Do you think there's some worst-case data distributions where this
algorithm would perform particularly badly?

I don't expect some bad cases in terms in IO. My most worrying is about
index quality which is strongly related to data distribution.

------
With best regards,
Alexander Korotkov.

#6Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: WIP: Fast GiST index build

On Mon, Jun 6, 2011 at 2:51 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Do you think there's some worst-case data distributions where this
algorithm would perform particularly badly?

I think there could be some worst-case GiST applications. Just now gist fast
build algorithm invokes more penalty calls than repeatable insert algorithm.
If I succeed then it will invoke even more such calls. So, if penalty
function is very slow then gist fast build will be slover then
repeatable insert.

------
With best regards,
Alexander Korotkov.

#7Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#6)
Re: WIP: Fast GiST index build

On Mon, Jun 6, 2011 at 4:14 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

If I succeed then it will invoke even more such calls.

I meant here that if I succeed in enhancements which improve index quality
then fast build algorithm will invoke even more such calls.

------
With best regards,
Alexander Korotkov.

#8Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#7)
1 attachment(s)
Re: WIP: Fast GiST index build

I've tried index tuples sorting on penalty function before buffer relocation
on split. But it was without any success. Index quality becomes even worse
than without sorting.
The next thing I've tried is buffer relocation between all neighbor buffers.
Results of first tests is much more promising. Number of page accesses
during index scan is similar to those without fast index build. I'm going to
hold on this approach.

test=# create index test_idx on test using gist(v);
NOTICE: Level step = 1, pagesPerBuffer = 406
CREATE INDEX
Time: 10002590,469 ms

test=# select pg_size_pretty(pg_relation_size('test_idx'));
pg_size_pretty
----------------
6939 MB
(1 row)

test=# explain (analyze, buffers) select * from test where v <@
'(0.903,0.203),(0.9,0.2)'::box;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=4366.78..258752.22 rows=100000 width=16)
(actual time=1.412..2.295 rows=897 loops=1)
Recheck Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
Buffers: shared hit=1038
-> Bitmap Index Scan on test_idx (cost=0.00..4341.78 rows=100000
width=0) (actual time=1.311..1.311 rows=897 loops=1)
Index Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
Buffers: shared hit=141
Total runtime: 2.375 ms
(7 rows)

test=# explain (analyze, buffers) select * from test where v <@
'(0.503,0.503),(0.5,0.5)'::box;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=4366.78..258752.22 rows=100000 width=16)
(actual time=2.113..2.972 rows=855 loops=1)
Recheck Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
Buffers: shared hit=1095
-> Bitmap Index Scan on test_idx (cost=0.00..4341.78 rows=100000
width=0) (actual time=2.016..2.016 rows=855 loops=1)
Index Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
Buffers: shared hit=240
Total runtime: 3.043 ms
(7 rows)

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.1.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.1.0.patch.gzDownload
��Z�Mgist_fast_build-0.1.0.patch�<kw�F����h��������W�a�Y`6sOv��,5���XI�����~��-����G����������^]]�z�.��8r��l�^���82���^�l�������w�FC�=������E{�j�)�pv{7�|��"q,��-��wBx���])��jv���z����lz�7��gP�fH�����~8V_lg>��c�;����|�&�?|:�BGw��t����N)n�!N8��AszO�b��[�&����8!�����������@}�TY�32�:�ITg��k�DZ���z'ad������(�6'�r���xk*���[�k����9���`,��N��1�� ��^��}�S��`P�4����h�$��W��r����E�������$�J%�O	5;I��������hw�����������Y$�C"���,��
��X
4h �?=�O�#�Ld,��y,]�Y��D�����
�1<��H�(R"�!*�i�+��#@��C�������G�D��d
� �1������_��dx�053���
n7�G,j02���0�D����xa �$D|c^d�����s�;S�T�K��a�t9���Q�3�UL����s;�OE����2����/�^�}������������:-~��K7��XFI�`���� r)nb��He$e<�=�H�M����P,L8�>��U�[��&�QHZ&�TnM��Xla�YTWBK��������l���`@���`� 7��h��z>���}7�@���,�������$,aF�����QF��yJ	`�\��&�j;u_h�����IN��H���/�X}��z�KK��;8�:�]���H������`��U���\�,uu�Ap	!��~�
aS��v��;{"�Vk>�����/M�����$������do
R0E���)H������o����Oe�?#/�L-hLHG&�)��$|Dpnbc'<5x2���������w�g�����i�jP��W���;�N�u�	su1���oo����~�E�-C�ynP������^��f
���c8�����>HAD��x����^��3�-��Oz��T���R�OH**�?K�$��C������I@������X\PQ�%M���l�
�
����$"6b��#�X�A6��EB���W����^��sWi��X;�Yu	�n�����[����W���x����� �����������=�j�Z� B����HJ0���5����V�3�7
Mf�@��>�ldN���2P^&�_�G^�j�rSt�
@��U��5q|Lf��s3�}�������D�"���i����g���vpA�TM���y���GZ���wqz��zcG���������_C��O{���N�Zy��}��M"wg���������pM5�
�j��6gvm�!7N"�6]�[tg_�7��G��%��������y��L�T���cq�����!��)"�U��[Z����o���Y���m�����%�	�B�|�Lz��*��%j�u����!���7�=0��s���H?��A2F_�z�j���>�����].���;��sVN-�����<���b6N<�����i�#2���"�Z0d�!�I�"��'1�}Y�8�	���(�,���	G�
{?�z4���hH�n0*w�&o\F�7�
Xj�I��3�M�:U���*V�-L�<
�y�����D�7��&N���d���-��)�S������x$hF
\H���CY[X��L�Q��,��`8������e
��R[Uj]��	7Ci� ��)���U�!��Neb���k�_���S��@h���IGaoe���Y�`����Oc#5#�oj[k�;�����
���ZY�bl"p�[������{`���N�G�z�����IA8A�A��i���Q�;qo	���d���*��U�����|7N��@��Bq\(������3�"�!���s��6�8+�q���i5�W,���
<4kg�p�-����b�Z�%1��P�\���k ������&��U{��gv�&t��xb�,�q�A;Eti$��>�6@6+���	���'���.�)PO�|������������ �q~>���/�6�)N���b�7Pa���
x�
�M�K��&�#��G ��J����P��b{�@4��PH<�88����H�������]��`H������]ox{���.O��7�Z�����*������c�^��T�Xk�=��i��bf�u�+�a
!dM`2XoT�I�Z2/PB�7��{��I�W0{{�~�3.�����s��cCzK���4��onz��
���?r��w��l�o;R�T����(�Tm�&j#M]m�
P�i}�4[0!������S�m�����m?)�u�*����0i�iD�e�;����������SfHMb�Ii��']�T�������~/x�)��D�$��z��)@q���Pr��.�"�=Bj9�@^�'�E�TB��s�bi�u!�Z���'�t��!h��T<�="n��v���4�����w�C��uv��mC��&��������_y�;��;�>�4Ma�|T���A��AP;�v��g6��t/��k��3+v��4s���J�{S����Q�Ot�eL�o�����o�D�Y	A4�������P�F������D�6e��<�pbg���2Wq��w������G�:D��,u�����j
M�����[)5X���f�����o1~\���W�$*��e��;���n&�_�����;$18�������KUiw����-U	BB����R�0K�p���<+!cw��`����W��EQ�����F�W�Y�������.��G�em�V���U�^����p���V��k�������Z*8����A��sd{���gQ���:��:��	�B�Z�|�q�5h��ps���+����f�`Y���86�*����R��']�����A��K���n,��s�����S��k~�L������_�6���l!v�T^��l�A��F�����&G`���A��C�pg��D����S������f�|��c�R.��9O�{��b�����[9��9R�����j�(D����lf ��K���l<%�0C���������� ��GY�%=���A��4���,�^�R?���P�oGE��"J�(�E�(�O	����q���|d��;������XZ���^^���Xo	��^,�L�5���Y��Z�M�1��]`Fnl$t�HI,�t��O��$�R����8�|��f7f,�e+U�H�?�k�{��y��t�],J�	PFW��,��Vv��;cj��{���rR��r��p������,Z����u���N�[�+���dk��Z�Vk�+H�2���R�\WI�%�d��fe������B����i:���v�5O��D�y��V��nm[���u�����2VN��.i�X�m��z@������,�*�x�{7���B��r���	������e�`lZM����dhB�v���p��A�����<yw��9������Y��f �������r�n�
l���U�FKk}�DTKY�Z�}������=�
<��6���/��h��n�wtE�k�����!��4t��o�mgf�[����(_�\������P�$TC{���!�_r��"��Q�@��\�:�%�������s��9Aq��G��������UH�U���*���������m�K|����yZ*����;�;��Q	1��K��X��m��U���� 3��-��YKSb��N�`[���aX^O���S�(/:S�1X���������?������X�GY�|ujy�y�V�����QDwv2�CY�.j?�a��8�������g7��)��t�����
3��52���Qtl���s-�����-,�y�e���Z��0u#�h�
��s�b���Z��"C��u?
�W�rlAFX�Nyc�N�R�Y'x	C\y���A1Kk�X��6��
�x�Xuj�}x�����m��2���z���
�s� �pF��W���9P;(c���$�]�?���	��AL4�m`�=��v/�g�����.[
��.�/�n�hv���M�k�����r�b�|;��Lf�8vb�����,k��o�3�>S/vlw�-4��@���q����+d�/�"�?�b�f��������b�������7�]l�,��H��/���\R��� ��|d�i�R7���-b���7�iYt ��]I�C��u���_x����?M�!������=�!��o��Mi	+������;Zv��Qe��0G�f��G����-R����F������@��n������i"��v�a��O�%�/m�\hP���<���������"�1	�������}��,;gZ	���+���4E�b��	��=����;Z��,���lkU����L@��|����5�4[�im	D) @�d����9�E���T�"��T5~��%�������pL��X�g�`s���HTW����EDD�*������N��W�8vJK�X�k�G����)U��������IEKe7,���O?�^��h���)*[u�`�1MtE��f�=���������{��e
a
�USw�j��;�������d,�fL����k�^��5���F��{�Y��zvZ�
<dk+O���D��}�29&8J[Z��_s���F�A/���9c��(PI{j��z�R=�kM~�m�L�����a�_��GO�e������!�9(���Ju9k�"p������di�][��i���������7�l-��
��'���\�h����q����G�������E-������0?��&<�6R��d~��MgU'�D�,��%r�;�H�s�m��^UcsM����=?��jD���?����XU�4�f�I��[����*S�yS��8U��JW+���2Z���f&��L0K=�.yd)h���N��%�����������U|N&���="?�c�.�]{T�s%�W����N���/������7����,rB����E-�
�[f�Q(/��Z�U�lBZ
����r@`r��3���!���h������+OI����V�'n����������!�!	�i�����;���ju���T�X��'�V��A�6,/�K��R�)��y���?HeeYP_���#]$�g?��D���G�����'��!X�;sk~c�a���+��TfT*����u"]��W2`w���0/��7Qc�����(���=J����o;���\maiP|
�R��H�[��w��!n��[*`��J$`��N2�B,�u\����_�"�J�UOS��(�V���W�*���xn�*P��8���|F���1����[S���&����S��-�}:����
�L�����b������>��I'��X��\n��yR
�������k�z���M� �5��W�
Uqn�B�k��j��#�4�������0)#�������;x}\F�L����K
~�z*yP�7�����~F�d���
U��r
T���Ra�p��i�`&�����+D�m*;�2�����5|GQ@K3��q�'�r����hnU�g<nf������_CJC��t���� ��a�1�m��#��P������-���0S���g4|c��/�IkL���V<�d��/��u/'Y�T��%��$�yu��A���l�L^0�Q-���a}q6������5��J�	�>���w�uy�d����V9f;��WCR�T��c�:=�+1�����K�
��0^OP[J;�9{��E��B�c0G����Wt%�4�L*eRf��^4tTE��l�3x=���#����{��0M��"��������[2�2�>z<�&6�[fTU�TnC�)%AG�|��Q��3���4���j�T���Y3W9o����k��[>s�5V�,�W�\F��X� �&�T�4�mD<J^��%!zM;j����%^�@
����S��w��'g��.��������Y��/��)��i'�HF�?!���AT�p�V���P���\����m�m����Wl;�C���K�vG��D;B��!)��  (r��-�j��#�0��0/�33�0��_H�hj]�j��lnRNN���-��W�Z����cyB����L-����M�j6+B�;��6�F�n��m<��L����5�p�'&��&����������m1)�(CF%�YF����f��
�
��im��K��Y
�{I�d���x��
�1��������t��@$���V�i��fsO{�s���
(4%�
l��#��w�=���Bcw����%q���K�
����a(,�=�Pv�@���]��K0]�`7�-3�7�h�=AE����dh�b��tX�k����K�Rk/�X����3��brA}��J��-��e�0U]����7��Rw�R����f�`D����T}B��9GKI���O4/����*e���,�=x~��y������>��"'�0�r��B�R
(2x�4E���� 5�l�\��}%��j��,/�7����)�A��\(���L �D�.���O�i��6�����[����nq]�i=����w�rh���up������?6s�t��aw6����2�iV����x�$�{��4�a���yv�<s��a�[���������m\����e����������]y��[?��:/��{�z�%�����L���,��k{����
e���Q�5e�H��
����QA�Q�2(0�"�l��R�`K��0^G��=gc��H�?+M� �����3Uj����f`E���0zL[����ZT!N��~�Hi�U��7�j�Y����L[����[�x	V���X��U�~)�qu ^�~�s����j]���sT�Fk�dg���f��zP<�h���&���Z(�u�zEHI�J��8i�u���.�J�B?��`��k[?���!U��?/�����@H�!�)��p�H����
����F|���C����Les��G�
�E��Y18m����zOm:��7Cc�}*��7��"!
�d9=��[���z����@�Xer��ya�N����
��3�a��\ (,t����3�l�^�J��+���=�z?��4/��F�����l6���:��8�����Y���v��j>���^��OB&�ZW�S�m��m8���B�l)��X�G��=��_J�%��N-���>B��,k�EL�F)N+��M�L�k�[=Pp���R��`3<���k�!����6���4(tFk�}�r&�#��!a��|���ng
���`'�xb�1P�+�(g
��-����R�7��:jF���Y	�T��*�R
�g��m�-���1dT�������d!s�+^ov���m!7�8���������_l���������g5V�}�D�h�bw'�������
��w�#I�n����%�b��ag��u��s���RW�������y���W�u��c�9���8 ����t���?��S���pa�_�3���.t�%,�9c&�wn�
�v�*~����_�������1���������_��������_�������m
��.����������h������iq��V����e�R2)w�r��*9���G{\�����djY���W?/�]�`>����;R�X�d�F�9�R/;���T_4�_��������2�_gq<������<!o�e�7L����Cr��L��������W,-%��0t�����?4�+s���r�a��!��6����'�3�����t��)AZ����s�e)gp������[8�����oj���^9��z��E�x�#�����zo,/[�����0�'K"�;�5�d2f�����S�(OX,��c^/�����p�m��v�2�n�0�����<�i�l2��]CE/(��~��A�U��X���)�i2�o��r��"�p�Bd�����oz�����b��Xg��y��u�,�x��!���`���w�������o_��Qe"u[�T)(�6�.l�t��ous�pBxU����/�q��p4�P�%g�<�&�Y��,>��,vZ�]y'9&;�'��B'���B��Y�V�&�;o�/��A��h��������v�8��FP��J#�|��^�c1�#�+/���[>��`�Fz���B�������/'����@�����J^���>�����=���p���f�?��	�����8=�l&lJ����
�T{l�4T���So&��z&�;�N�&����������e?W1����L�����	�_�8<�~����|�-����fi��R��%1���E�����C��Z�W�$k����d�
�^���XB�#����NR \���=�N	�5���m6_��i���������fP�4�	����]����ft8�$(I�n����Ed7s�M!5-
�h�h�v<�sIh7�������Z�o��F�F�������yhF��@�F5��+/\~x�l+���]�\,�Y��P$��
�{�3�b2��6:^���1kH�$f���	��N�E�R�m�N�)��wVpb���{f�=}�G��O[�?Z����t�l3�r�p7&RG%(;rCk��QM��[���bI6c��d�uX"g��D�������dR�5���=�~"�7�'�����"6Id�t��2�l��Ey����O�kS�]f�����k�����e���kxM����e�>X.�.���wT'�} "$��>#f�� g��R�Gd`Z�0��Q�"'\e\*g���������k�����U{<��E�����;��j��
�h�9�RaY\#�S����+/��oJ����^���(���8�����2��iO���%�@�y���^��1�0�����`f����<A���T����k�Vo����������9�N�1
M��$�Wg��]��"��q��FE���^��H+-L/��2�������_�C2��!�:{��8�
����'g+�VO-������kE�o�;�����"*�l�d�~�>�}�;��>j��hr!*H��8f_z,����3tSCF��)wu=��`3>�8d�>�T
S���g��vU9;;=�9����@�r�mkT�vG�Mg��f����=��o��������<�0&pe>���;����N����%�y��Y*3'�Cy�[��������=r�q�d�1��x�=�g����6aP����L_���_�5�0����
������UW��EY�+�6�\��[�����������N"�y�0?�y��g!�c�9��V��������oc�����G_��l �1(�~h�������}}��=�,bumn�4b�
�V$�
�w?[��$�?�u���:�+�m�ts�E�����'�-R�Vv��|�7�����o�n����	;9���gD�	Q����F=s�!���Su*jhS&����yn]y�{B��@m���b'av�}��XD��Hb���B�q��)yNa��}���7�G
e��Z����;��\�!�3��[��~���}���b;j�+!@�r�������;��P���a�}�Wg~:7��k'�S3x����ovY~.|�js���nH��J��z�Xx
��v%]���(
�U<z��s����{��Mr��L�.P+��z�����z��F+�6/~�Q�����P������UQg�x�.�������E�����h�[�E�(��^�?��Z����nL�8HG3E9��*��,t�]��,���c ���o=������h����Z#����3K�CB~���dh�	�gSCGP�% �$��+hbNQ�7���B������PGr��)��g��\|@����������,QJCsU�~����)�}�4���k2a���KVu�����^�`"����9�D�u��A���$g���������(�2�@��B%��hb;9m^2�;dC@���������������6MWW��0(���U��������[{=d@�����z61�G�E����:N�m�
�m-�ps_<N ��c���(���L�����a%w",�� D58���IDs6_}����.���N�|>xI�mv��;��������DD���9��~�X�z���{�\
\�����3������tk���������-M�*�fS���NX���T�����=Vp����]��dl�t>��g`DT/��A�6+Ud�d^��������	�1��=�v�c��T1����E o1)������2����<t}��CL!��
��b�!���?�25�[+���Jr	t�(���v�$IPK���Pyu'��>1��"����Uy���V$��-�Z�2�]J��[*�*�j�M}��1�'��=�����M�WW�2kc�7�+�����#X;g���{�'�J+\����l����nA���E,��M�rF�����*���l�l���G���W���}g��h	�/h9O��V�G�dQ�#;�I��"U�
;�����\�8	�)]U���$�|��g�V����5���q�k���q�W��oPP�����,
�$���?o.����K��Z������o���g���{��������J)J�v��e��s�����Q'R2�j���u�/G�H&\���I9n��}��~�dm������~��8�r������%�Wl1�����\�u��7^U�iA��n	��t�����9D����� ���8�:���3��T�j�����{0PS�6�p����g�������(�2�QT2N@��d"o$I26��*�����'�`���mDI�=\�G7K�E��f�,����J�������%�g�����g���-er��SC��)��A��	��x�b�N�3!�����:~2(OA;�z�e��������������N�\i6��}y��������*�6�)/j�"OI3����N�&2���xbF�G�����������P���/7��)MiN������S8����"��72�g�M��v�����O:�/��q�wtj�d|����uE'Y���?
�����./��i��{X	�B@�4�~��M'r;�M1����m>/}��5�/���M�8�:��.�����b����`�f����K��a��R�N�w��,:��H�'�
 M����b���!�or?Av5�a��D���Q9�����w�q��W���N(a�����$"�sR���s�b3��}�G���g�;3����m�{18�#R�7?"���@��r����LQfM��=�V	(�N�����Ts�t�����8��P>��1���7{\B��O'�E������.R
�m�R�If|�2�d�8)�������5�r�`��/�M"�*�T��f�ry��Q�Y$r�f�q"����0$A��I�(2@��l��[�����
E.LP���XA�&:��a�~�;��r��g:��d�xR�>��Ho��
g�My����-[{W^O��&1�
Y���$�z���g��q7TEX�f1�������4�����s�6�`z78�i�8'6������y�u��_���<���P������_���>�?��pl�fs���k�^V�#J��� !��U3�Rp�-J+q,$�{����}��T����NY�^�T��������2�����	!�#6��=�=O�d�e�r_��/j=��~"��}_�x1)��
��g�F#���0���FDA��#�(�a:[,C��F�nX��x:i5JzD�F �������w;�u��
���}�����c:�9
�Wi1����O=�V�]�b�D��/�|��aH�~���\���A��th&�����<����w�)K�Xr{�R���:�?s1�G]P%��r�����w�O��Ib����I�Ix@q��h��^E�������!���V5�{$4�'���W�����l�����@�HH�gv����5|�|��Z�������N�%�U�QI.���@�]�t86�H��;@\,��:$mI�^|fg������r�V"���t�F�d-�qgW����'z6;�1,�����IE,<w��L��m����^��!���� HdhR�p���{*�#M,�O�k��r��5(
Z���7�#`Tb����N�U=��������Ng'5j,��?�i`$�;)�g��=JN("����2v�Q���!Fy�R�H �t����<���
����a��2��,&�����A�?-����=�c�TviW5<Ew��H���������)��d$5��1D�����T��[e���a 4��DckS���v)d���l�9Yo4��������������#wy���M��,��'L���	�p�����&��a�=���lV:���=uy<3����O`p�:����H��,C����ef�e��%�h�W}o��F�?Ck��k�S?�UN�� ������S�y�.�9v����jN�Cm��ub�#�Y[��COU�<��w���`�+ ����9 	�w�*�r
�pA�����9O5����:3
��.�Z0]�
vmg�=�4=�j��(%��a7��o���Luw����_���|��U�qe���f�X���la�&kr�@\���l��,�`�������!�TA�o:/�QE�x��|�C����yDK*�����6������9���D����7�^]�c�8��To%��k����9�o���)�$EbY�s�7�S����X�`�1HH.m\�qP�bF6����U'����f> <$� ��s4��f�"�Vx���>=e1��r2vR� pkv�����QiWv7�e��I�D>���������3����6�H�;�S�����g-u^�=j���i����S���*E�3�,��/���b���Z�I�AOJd�����X����/�G~g!����������z^�v �9i�"�p�a�696x�1��M�(U��&6-)��|:�p�<`����
P��dC-�IA�x&�QqP�{�7R	�Q��3�p�uHy�;_��7��PU���13e����dL����
~W�MD"��Ri�`+�������m��I���(����"K���z=�b�+&V���>�T�i'�;�oT����i����sO�����\j�A�F4�Y����Fd@���S�}X��|R����
��[{_(�8�V��r��DQ��D[�f�?Yl�r�����6_��p~Wd�jMe�C�8�d`��v������F��6�/�*�(��cbx���M�DW����v@4����X��q�?�M}��ZS �����,<�3��!3Tt/�~T����(wB5���`�,vV9���b�eo�G�37b��Z��,q�K~7oq�	'��H5���+���rL>X���N������s��}�kd�$c����� �
��H I�9'�������`�vq6��mD����.T�����,�pS�|�F:v8��th!P�2�($:�E-7���%���.O�������g��"��_�}T:X����NX����vAwR�X�W���]�V��
�e�,b�T:��V8e�1"�0���i�h��r.a���6k�l�K�����-����&����]^���n�(�����E7��Y��GP"2�����P��-dkm�y����,&�C�	@u�+t�x���U�� �+|���.|�������������8(d��5���v���k��K�e���E��j����|�i����
O��Rc�W���)��=��.���[>�$�L2!���f��Q�~YKLo	��-f`4kg��e�	�M�����K�8�r�*�L�>����S�ds�E+S���&�?���b��8��������=����m��l��_������R�.K�2�zJ��TYXW��I����3����Wj�Sa���KN�uc;���r��.��0.U�+z2���J��(�c����������.Q�3���-d�2�YmyM}��:������w�won�L/��~"�P�	�����QE���X!q�fH�j��g�����������rOZ�~;e�=<�\����n�X��U^��pX�]��{
�L$�~���-BN-���4,z���'�����������YE��|�h�v��������oO4J��\����B�������N������tu
oy?��5A��'�"�O��V�	@Y<��h�pf���*�(�����|KRS�k`�1�i$o�dj�A=Kd,	w�@`q;��,^�E��C�*nE(�����U��m���Z�lX���
���7�F��k�E��03;��{�D#r�9���%�����>���H �0� x9+�Z����-�'����'_a#�^��UZ���B�o���0����&=�g�������'��P�W(DO�Hkj1�P���r,X@��F�wr�+Mnd�_ptq�t��t"�x�{`�����c6�^V������������������<N���S������,���H^����?2@lf����F� �'����e*���_�3�#��5>����O����CRsj�&0��(�x!��'ST���/�q��T����)�-N��]��
{��Q�&�yP;������2T������l�*�4��c*�)�8K��*�N��i
����&%��`m�8faV��Z���u�e�>|#�3�@8Xnr;�m�6d��0:w��I�� 1���;�t<W���8� >����e��1j|��	���?��F�������S�!��������0*x���=K�������=�`���!���>?�W��5}3�9�S��8��#����O��@2V�M�|Pm�������Qw�����j4%�A3�����e���������I?;i��>�E#�3�#_t|M7,�J�A=��2� d ����@�+���s����T����xv}i��){����;���i`�1�F����?w�rD�X�V0�c��m(�^����W��Y�^a��B.+(�f�
9m�"��+p|�&]���^8�[dO�F�����o��(�������+e_RYWF��[8�W��������S��_#u�Z9�j�0i��������r�m�y~P�����;�i����"j��D���zq���G��m�����:47U�U�mv\�������;����p�N�5\�x4��������q���|�KU���%���h�g�D�+��6�%�|�,�&��y"������v����;��O0��8O��x���"��8|��}�2��N��
�Q�$e���r?[
����&�����H� X�{��#��I]�
k��[=�6�X�[�h����vQp�����H����d+v_gWP�<�a��m<���o�}�L���R�.2/"v���/�`��G�&�:G�����l�"K��.��*��&3M�=��g�p|���a>�N��V���>F���m�����;)�Pe���-'���+������L)��,
�wW��i[�)��,�����%vM������BL�����D�����nPS�w
�`�>��Q�L����R�g�6Vm�) ��<�� ��
0]�����t����	X��t�;JL�|:��{O�� ���4�UA�, IS�������P����f��KS��� Y��*�������f	�j�HF�<��<P��������W&j;��D��&�^d~��u�
}��8������ +
�����m��I�v=��|�{�I�K�5�������N������D�3�V�Q�S�+��8�}�}�;n�F���bT�Lq|��y����$=���d,��By~��0O,��H3&i]Wzf��lJ63�� ��=����m����p7z��._��Nan'���nv�y��"2�x�9Z(����d���O����I��C3,���*�S�FA��_�E6C�%C������\���;�8���[��2?�;�0Y�p��>�3��=��8��I	��`�{U�71
	M0�lHD�tQ(�lS����5���{P��|	���{������/o�K.����5x]��z��MU7IJd��;�5[������q7�FM�[�����K���*ma�`5��=
�f��gj�T�)����r:����Bx6����ARA�]4��[�CU[/�8P��N�()�i_N�a7~����)�S�2d@U�IoN�d����:�U",M�&Y����	f4'A���5�~�Y���U�n�` �n�r�Y��]���Gim����+G���~�'�ED��Y6x>R�jGt���]J����a�}�m%}/�o�N��@����n��)������K)���MJ����HT+�{��*����.�E���6"bY����LUs3v��e����.5��"��3��aBYx�b8�YWrW�
��a��7�6�I^�������	��������"}������i����O�'�{�Z�bzu�}3h70�������X�?	%�����Yz��������?W a��g���t��0�"jz&9kS�).���mZ3�Fc��~^{h������*^<!�
Y����$�>�*o-`T���/�a��C����7��-�������no�������&��!k����c��\�����Z��� �}5��g�6~�SP-���cSMw�[�-99x���c����f�u�
�[�����U
��f�v���L�*����R����?I���hXH5��b�|��x��<�V�O=��m���g&��b)�����&t�{�=��s�w�}3�b��|��j�a
������t�������H�0��X������Y�Ep������{a"1���d����DsO?&����6���v����)��v�h�d_N�,�UG��*8%n�����yzi)��@vo���W#0F�@E~ ��!��~9hNrM�"������S�)�~��Q��Y_y~�C�2>����4AC�/8��3F����]yp3��`o�h~j.������x��f(6�k?���5�|f�5W��\�6�.g(\��H.~X����KQ�r��'�)^����[K6���UI�m@?����mf>T�LY�CT�	�������$���$Rv�^���5Wm��pj���C������`n�p`�~���-��_b0���9T������$U,+�:�ZE��<��[�@C�D���jiMln�����P����S�������[<�B@9�S�fd<�V����x��y,��Z�IJ�*��j�sU����4��F-��U(�����4��.Yz�"EGgb���Emyp����������f�|j��))�\����pw�����J��Dn��K���$��6#���zI3G^|m�
ev[���f'�#*	h5<K���� �"������ZXT����w7�����=u�)�lDt��'�_���H�<5�poUu�-��U8s�/�������h�Q#��c�k�lT��
��4��M�*Y�����(V��!-D�A�'������b������~b6�����l<1�����:\=�T��Y]]-N�T\�p�/�|���O
�]�D��
�6a�[z`Xl��qa~^�SL�G����G�c������c7��4d������U�	�R������?���Q���g����{��W��n���D��e5]]�����9|��
�Er�M�����i"?��s�c�������)�.���������b�)�v�����$��:��%�yS(@a8��_
��^�������&l�L@Bu� <�C���������~���Dkg�RE���E
(^��@��t)j��j6<��O0iJ������B(7�=�	�r qU���F'c������t�C(��XJ&�a����XtvT��Ho�x���0b=�2.LW���pS@���7[���S�l�����:?l�}{������������B��Ot��;^i�_L��6h'J#��T�H��JD�9r��;����e� ]E$���������Kbi�5�"�[��.�������Q�G7W1���!O/��������8Q$�A���C����D�"[��E�R��`a�b�L��S�=���W*E��l�����u�e���I$�����O�- �����_��~a�#<|��n���91&��Nm�����%��9���^����P?`�k��V����qj=@S�!�[�.�=G�@�$T����~���>}|��Fl���?�w����F�/���O>o?���5����XT���4��@��50��4�N'mk�"�H����8T�/O�}�����p��gH^6{�h�r��@]��	�0���?m?~�$��������'O���2]UN$(� o����`�������k����
C�d������|@���g��y���Io�����-�;d-��?�H(W����e5�ep�7,�z�����)1
����2��c,�T�����M�Rr��y�cJ��x;D�2n}�����C� ���T����sL�_)������.KLyH_
$���6��'"w���>�z�2~�0��XT>���
��IMV�l�*&���0#���HCHb4D�?e?������@@����,�7�3�P���9E\�6��6����I%��U�W�o=����A�w�U��k{sc����I)����7:�����"�?���$��=W��*�R)��uu�r��y�������I�!����L��
o���
��e�raj�J*�B$I������d*6D��������v�T�;<������=���MJ!��A4�:��y=��#U�E9;#�$'������9����\��|�ko�~�}�������~w���QkI�'����CGZpk�{"��������_1�
�����p�C�qsN��/�.l��|(����i0��((("lAl�D�8����/�W1�G��w��!-��We$��A�(~Kl��WL���f�P��iS���V�P���Kl1D��M��o,|��+�����P�=q�%VF���Zi����/�O�>Up��g�����u����������
Q�A��+V���[T��������k���f���3�
Hb$�-��k����f�9o&i/�}�QM�<��t]�������7�v��FC��7K0�n�o^��b����mbD}��Pn6���p�,��9��5��q��������
z;��\���O�!?1O�����p�;���o��
���`;!���$�R�����!(��3�B�d���������K�3~���/��X
�
����x8�
�@��~��B���oEw�B=�^�I���Y�	���5M,��gmDx����eZ
#9Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#8)
Re: WIP: Fast GiST index build

On Wed, Jun 15, 2011 at 11:21 AM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

I've tried index tuples sorting on penalty function before buffer
relocation on split. But it was without any success. Index quality becomes
even worse than without sorting.
The next thing I've tried is buffer relocation between all neighbor
buffers. Results of first tests is much more promising. Number of page
accesses during index scan is similar to those without fast index build. I'm
going to hold on this approach.

test=# create index test_idx on test using gist(v);
NOTICE: Level step = 1, pagesPerBuffer = 406
CREATE INDEX
Time: 10002590,469 ms

I forget to say that build time increases in about 40%, but it is still
faster than ordinal build in about 10 times.

------
With best regards,
Alexander Korotkov.

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#9)
Re: WIP: Fast GiST index build

On 15.06.2011 10:24, Alexander Korotkov wrote:

On Wed, Jun 15, 2011 at 11:21 AM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

I've tried index tuples sorting on penalty function before buffer
relocation on split. But it was without any success. Index quality becomes
even worse than without sorting.
The next thing I've tried is buffer relocation between all neighbor
buffers. Results of first tests is much more promising. Number of page
accesses during index scan is similar to those without fast index build. I'm
going to hold on this approach.

test=# create index test_idx on test using gist(v);
NOTICE: Level step = 1, pagesPerBuffer = 406
CREATE INDEX
Time: 10002590,469 ms

I forget to say that build time increases in about 40%, but it is still
faster than ordinal build in about 10 times.

Is this relocation mechanism something that can be tuned, for different
tradeoffs between index quality and build time? In any case, it seems
that we're going to need a lot of testing with different data sets to
get a better picture of how this performs. But at least for now, it
looks like this approach is going to be acceptable.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#10)
Re: WIP: Fast GiST index build

On Wed, Jun 15, 2011 at 12:03 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Is this relocation mechanism something that can be tuned, for different
tradeoffs between index quality and build time?

Yes, it can. I believe that it can be index parameter.

In any case, it seems that we're going to need a lot of testing with
different data sets to get a better picture of how this performs.

Sure. My problem is that I haven't large enough reallife datasets. Picture
of syntetic datasets can be unrepresentative on reallife cases. On smaller
datasets that I have I actually can compare only index quality. Also, tests
with large datasets takes long time especially without fast build. Probably
solution is to limit cache size during testing. It should allow to measure
I/O benefit even on relatively small datasets. But while I don't know now to
do that on Linux.

------
With best regards,
Alexander Korotkov.

#12Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#11)
Re: WIP: Fast GiST index build

Actually, I would like to measure CPU and IO load independently for more
comprehensive benchmarks. Can you advice me some appropriate tools for it?

------
With best regards,
Alexander Korotkov.

#13Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#12)
Re: WIP: Fast GiST index build

My current idea is to measure number of IO accesses by pg_stat_statements
and measure CPU usage by /proc/PID/stat. Any thoughts?

------
With best regards,
Alexander Korotkov.

On Thu, Jun 16, 2011 at 1:33 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

Show quoted text

Actually, I would like to measure CPU and IO load independently for more
comprehensive benchmarks. Can you advice me some appropriate tools for it?

------
With best regards,
Alexander Korotkov.

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#13)
Re: WIP: Fast GiST index build

On 16.06.2011 21:13, Alexander Korotkov wrote:

My current idea is to measure number of IO accesses by pg_stat_statements
and measure CPU usage by /proc/PID/stat. Any thoughts?

Actually, you get both of those very easily with:

set log_statement_stats=on

LOG: QUERY STATISTICS
DETAIL: ! system usage stats:
! 0.000990 elapsed 0.000000 user 0.000000 system sec
! [0.000000 user 0.008000 sys total]
! 0/0 [32/0] filesystem blocks in/out
! 0/0 [0/959] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/0 [10/1] voluntary/involuntary context switches
STATEMENT: SELECT generate_series(1,100);

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#15Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#14)
Re: WIP: Fast GiST index build

Oh, actually it's so easy. Thanks.

------
With best regards,
Alexander Korotkov.

On Thu, Jun 16, 2011 at 10:26 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Show quoted text

On 16.06.2011 21:13, Alexander Korotkov wrote:

My current idea is to measure number of IO accesses by pg_stat_statements
and measure CPU usage by /proc/PID/stat. Any thoughts?

Actually, you get both of those very easily with:

set log_statement_stats=on

LOG: QUERY STATISTICS
DETAIL: ! system usage stats:
! 0.000990 elapsed 0.000000 user 0.000000 system sec
! [0.000000 user 0.008000 sys total]
! 0/0 [32/0] filesystem blocks in/out
! 0/0 [0/959] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/0 [10/1] voluntary/involuntary context switches
STATEMENT: SELECT generate_series(1,100);

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#15)
1 attachment(s)
Re: WIP: Fast GiST index build

New version of patch. There are some bugfixes, minor refactoring, comments
(unfortunatelly, not all the code is covered by comments yet). Also
"fastbuild" parameter was added to the GiST index. It allows to test index
building with and without fast build without postgres recompile.

------
With best regards,
Alexander Korotkov.

On Thu, Jun 16, 2011 at 10:35 PM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

Show quoted text

Oh, actually it's so easy. Thanks.

------
With best regards,
Alexander Korotkov.

On Thu, Jun 16, 2011 at 10:26 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 16.06.2011 21:13, Alexander Korotkov wrote:

My current idea is to measure number of IO accesses by pg_stat_statements
and measure CPU usage by /proc/PID/stat. Any thoughts?

Actually, you get both of those very easily with:

set log_statement_stats=on

LOG: QUERY STATISTICS
DETAIL: ! system usage stats:
! 0.000990 elapsed 0.000000 user 0.000000 system sec
! [0.000000 user 0.008000 sys total]
! 0/0 [32/0] filesystem blocks in/out
! 0/0 [0/959] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/0 [10/1] voluntary/involuntary context switches
STATEMENT: SELECT generate_series(1,100);

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

gist_fast_build-0.2.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.2.0.patch.gzDownload
#17Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#16)
Re: WIP: Fast GiST index build

Hi!

I've created section about testing in project wiki page:
http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results
Do you have any notes about table structure?
As you can see I found that CPU usage might be much higher
with gist_trgm_ops. I believe it's due to relatively expensive penalty
method in that opclass. But, probably index build can be still faster when
index doesn't fit cache even for gist_trgm_ops. Also with that opclass index
quality is slightly worse but the difference is not dramatic.

------
With best regards,
Alexander Korotkov.

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#17)
Re: WIP: Fast GiST index build

On 21.06.2011 13:08, Alexander Korotkov wrote:

I've created section about testing in project wiki page:
http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results
Do you have any notes about table structure?

It would be nice to have links to the datasets and scripts used, so that
others can reproduce the tests.

It's surprising that the search time differs so much between the
point_ops tests with uniformly random data with 100M and 10M rows. Just
to be sure I'm reading it correctly: a small search time is good, right?
You might want to spell that out explicitly.

As you can see I found that CPU usage might be much higher
with gist_trgm_ops.

Yeah, that is a bit worrysome. 6 minutes without the patch and 18
minutes with it.

I believe it's due to relatively expensive penalty
method in that opclass.

Hmm, I wonder if it could be optimized. I did a quick test, creating a
gist_trgm_ops index on a list of English words from
/usr/share/dict/words. oprofile shows that with the patch, 60% of the
CPU time is spent in the makesign() function.

But, probably index build can be still faster when
index doesn't fit cache even for gist_trgm_ops.

Yep.

Also with that opclass index
quality is slightly worse but the difference is not dramatic.

5-10% difference should be acceptable

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#18)
Re: WIP: Fast GiST index build

On Fri, Jun 24, 2011 at 12:40 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 21.06.2011 13:08, Alexander Korotkov wrote:

I've created section about testing in project wiki page:
http://wiki.postgresql.org/**wiki/Fast_GiST_index_build_**
GSoC_2011#Testing_results<http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results&gt;
Do you have any notes about table structure?

It would be nice to have links to the datasets and scripts used, so that
others can reproduce the tests.

Done.

It's surprising that the search time differs so much between the point_ops
tests with uniformly random data with 100M and 10M rows. Just to be sure I'm
reading it correctly: a small search time is good, right? You might want to
spell that out explicitly.

Yes, you're reading this correctly. Detailed explanation was added to the
wiki page. It's surprising for me too. I need some more insight into causes
of index quality difference.

Now I found some large enough real-life datasets (thanks to Oleg Bartunov)
and I'm performing tests on them.

------
With best regards,
Alexander Korotkov.

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#18)
2 attachment(s)
Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

On 24.06.2011 11:40, Heikki Linnakangas wrote:

On 21.06.2011 13:08, Alexander Korotkov wrote:

I believe it's due to relatively expensive penalty
method in that opclass.

Hmm, I wonder if it could be optimized. I did a quick test, creating a
gist_trgm_ops index on a list of English words from
/usr/share/dict/words. oprofile shows that with the patch, 60% of the
CPU time is spent in the makesign() function.

I couldn't resist looking into this, and came up with the attached
patch. I tested this with:

CREATE TABLE words (word text);
COPY words FROM '/usr/share/dict/words';
CREATE INDEX i_words ON words USING gist (word gist_trgm_ops);

And then ran "REINDEX INDEX i_words" a few times with and without the
patch. Without the patch, reindex takes about 4.7 seconds. With the
patch, 3.7 seconds. That's a worthwhile gain on its own, but becomes
even more important with Alexander's fast GiST build patch, which calls
the penalty function more.

I used the attached showsign-debug.patch to verify that the patched
makesign function produces the same results as the existing code. I
haven't tested the big-endian code, however.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

fast_makesign.patchtext/x-diff; name=fast_makesign.patchDownload
diff --git a/contrib/pg_trgm/trgm_gist.c b/contrib/pg_trgm/trgm_gist.c
index b328a09..1f3d3e3 100644
--- a/contrib/pg_trgm/trgm_gist.c
+++ b/contrib/pg_trgm/trgm_gist.c
@@ -84,17 +84,88 @@ gtrgm_out(PG_FUNCTION_ARGS)
 static void
 makesign(BITVECP sign, TRGM *a)
 {
-	int4		k,
-				len = ARRNELEM(a);
+	int4		len = ARRNELEM(a);
 	trgm	   *ptr = GETARR(a);
-	int4		tmp = 0;
+	char	   *p;
+	char	   *endptr;
+	uint32		w1,
+				w2,
+				w3;
+	uint32		trg1,
+				trg2,
+				trg3,
+				trg4;
+	uint32	   *p32;
 
 	MemSet((void *) sign, 0, sizeof(BITVEC));
 	SETBIT(sign, SIGLENBIT);	/* set last unused bit */
-	for (k = 0; k < len; k++)
+
+	if (len == 0)
+		return;
+
+	endptr = (char *) (ptr + len);
+	/*------------------------------------------------------------------------
+	 * We have to extract each trigram into a uint32, and calculate the HASH.
+	 * This would be a lot easier if each trigram was aligned at 4-byte
+	 * boundary, but they're not. The simple way would be to copy each
+	 * trigram byte-per-byte, but that is quite slow, and this function is a
+	 * hotspot in penalty calculation.
+	 *
+	 * The first trigram in the array doesn't begin at a 4-byte boundary, the
+	 * flags byte comes first, but the next one does. So we fetch the first
+	 * trigram as a special case, and after that the following four trigrams
+	 * fall onto 4-byte words like this:
+	 *
+	 *  w1   w2   w3
+	 * AAAB BBCC CDDD
+	 *
+	 * As long as there's at least four trigrams left to process, we fetch
+	 * the next three words and extract the trigrams from them with bit
+	 * operations.
+	 *------------------------------------------------------------------------
+	 */
+	p32 = (uint32 *) (((char *) ptr) - 1);
+
+	/* Fetch and extract the initial word */
+	w1 = *(p32++);
+#ifdef WORDS_BIGENDIAN
+	trg1 = w1 << 8;
+#else
+	trg1 = w1 >> 8;
+#endif
+	HASH(sign, trg1);
+
+	while((char *) p32 < endptr - 12)
 	{
-		CPTRGM(((char *) &tmp), ptr + k);
-		HASH(sign, tmp);
+		w1 = *(p32++);
+		w2 = *(p32++);
+		w3 = *(p32++);
+
+#ifdef WORDS_BIGENDIAN
+		trg1 = w1 & 0xFFFFFF00;
+		trg2 = (w1 << 24) | ((w2 & 0xFFFF0000) >> 8);
+		trg3 = ((w2 & 0x0000FFFF) << 16) | ((w3 & 0xFF000000) >> 16);
+		trg4 = w3 << 8;
+#else
+		trg1 = w1 & 0x00FFFFFF;
+		trg2 = (w1 >> 24) | ((w2 & 0x0000FFFF) << 8);
+		trg3 = ((w2 & 0xFFFF0000) >> 16) | ((w3 & 0x000000FF) << 16);
+		trg4 = w3 >> 8;
+#endif
+
+		HASH(sign, trg1);
+		HASH(sign, trg2);
+		HASH(sign, trg3);
+		HASH(sign, trg4);
+	}
+
+	/* Handle remaining 1-3 trigrams the slow way */
+	p = (char *) p32;
+	while (p < endptr)
+	{
+		CPTRGM(((char *) &trg1), p);
+		HASH(sign, trg1);
+		p += 3;
 	}
 }
 
showsign-debug.patchtext/x-diff; name=showsign-debug.patchDownload
diff --git a/contrib/pg_trgm/trgm_gist.c b/contrib/pg_trgm/trgm_gist.c
index b328a09..b5be800 100644
--- a/contrib/pg_trgm/trgm_gist.c
+++ b/contrib/pg_trgm/trgm_gist.c
@@ -44,6 +44,9 @@ Datum		gtrgm_penalty(PG_FUNCTION_ARGS);
 PG_FUNCTION_INFO_V1(gtrgm_picksplit);
 Datum		gtrgm_picksplit(PG_FUNCTION_ARGS);
 
+PG_FUNCTION_INFO_V1(gtrgm_showsign);
+Datum		gtrgm_showsign(PG_FUNCTION_ARGS);
+
 #define GETENTRY(vec,pos) ((TRGM *) DatumGetPointer((vec)->vector[(pos)].key))
 
 /* Number of one-bits in an unsigned byte */
@@ -98,6 +101,32 @@ makesign(BITVECP sign, TRGM *a)
 	}
 }
 
+static char *
+printsign(BITVECP sign)
+{
+	static char c[200];
+	char *p = c;
+	int i;
+	for(i=0; i < SIGLEN;i++)
+	{
+		p += snprintf(p, 3, "%02x", (unsigned int) (((unsigned char *) sign)[i]));
+	}
+	return c;
+}
+
+Datum
+gtrgm_showsign(PG_FUNCTION_ARGS)
+{
+	text	   *in = PG_GETARG_TEXT_P(0);
+	BITVEC		sign;
+	TRGM	   *trg;
+
+	trg = generate_trgm(VARDATA(in), VARSIZE(in) - VARHDRSZ);
+	makesign(sign, trg);
+
+	PG_RETURN_TEXT_P(cstring_to_text(printsign(sign)));
+}
+
 Datum
 gtrgm_compress(PG_FUNCTION_ARGS)
 {
#21Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#20)
Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

On Fri, Jun 24, 2011 at 12:51 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 24.06.2011 11:40, Heikki Linnakangas wrote:

On 21.06.2011 13:08, Alexander Korotkov wrote:

I believe it's due to relatively expensive penalty
method in that opclass.

Hmm, I wonder if it could be optimized. I did a quick test, creating a
gist_trgm_ops index on a list of English words from
/usr/share/dict/words. oprofile shows that with the patch, 60% of the
CPU time is spent in the makesign() function.

I couldn't resist looking into this, and came up with the attached patch. I
tested this with:

CREATE TABLE words (word text);
COPY words FROM '/usr/share/dict/words';
CREATE INDEX i_words ON words USING gist (word gist_trgm_ops);

And then ran "REINDEX INDEX i_words" a few times with and without the patch.
Without the patch, reindex takes about 4.7 seconds. With the patch, 3.7
seconds. That's a worthwhile gain on its own, but becomes even more
important with Alexander's fast GiST build patch, which calls the penalty
function more.

I used the attached showsign-debug.patch to verify that the patched makesign
function produces the same results as the existing code. I haven't tested
the big-endian code, however.

Out of curiosity (and because there is no comment or Assert here), how
can you be so sure of the input alignment?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#21)
Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

On 24.06.2011 21:24, Robert Haas wrote:

Out of curiosity (and because there is no comment or Assert here), how
can you be so sure of the input alignment?

The input TRGM to makesign() is a varlena, so it must be at least 4-byte
aligned. If it was not for some reason, the existing VARSIZE invocation
(within GETARR()) would already fail on platforms that are strict about
alignment.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#23Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#22)
Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

On Fri, Jun 24, 2011 at 3:01 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 24.06.2011 21:24, Robert Haas wrote:

Out of curiosity (and because there is no comment or Assert here), how
can you be so sure of the input alignment?

The input TRGM to makesign() is a varlena, so it must be at least 4-byte
aligned. If it was not for some reason, the existing VARSIZE invocation
(within GETARR()) would already fail on platforms that are strict about
alignment.

Hmm, OK. Might be worth adding a comment, anyway...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#24Jesper Krogh
jesper@krogh.cc
In reply to: Heikki Linnakangas (#3)
Re: WIP: Fast GiST index build

On 2011-06-06 09:42, Heikki Linnakangas wrote:

took about 15 hours without the patch, and 2 hours with it. That's
quite dramatic.

With the precense of robust consumer-class SSD-drives that can be
found in sizes where they actually can fit "many" database usage
scenarios. A PostgreSQL version is not likely to hit the streets before
50% of PostgreSQL users are sitting on "some kind" of flash based
storage (for the part where the entire dataset doesn't fit in memory
any more). Point is:

* Wouldn't it be natural to measure the performance benefits of
disc-bound tests in an SSD setup?

... my understanding of Fast gi(n|st) index build is that it is
more or less a challenge to transform a lot of random IO workload
to be more sequential and collapse multiple changes into fewer.

In terms of random IO an SSD can easily be x100 better than rotating
drives and it would be a shame to optimize "against" that world?

--
Jesper

#25Alexander Korotkov
aekorotkov@gmail.com
In reply to: Jesper Krogh (#24)
Re: WIP: Fast GiST index build

On Sat, Jun 25, 2011 at 11:03 AM, Jesper Krogh <jesper@krogh.cc> wrote:

* Wouldn't it be natural to measure the performance benefits of
disc-bound tests in an SSD setup?

Sure, it would be great to run performance tests on SSD drives too.
Unfortunately, I don't have corresponding test platform just now.

... my understanding of Fast gi(n|st) index build is that it is

more or less a challenge to transform a lot of random IO workload
to be more sequential and collapse multiple changes into fewer.

The main benefit of proposed algorithm is to greatly reduce number IO
operations during index build due to dealing with great number of index
tuples simultaneously. And it also makes some IO more sequential. I haven't
precise measures yet, but I belive that contribution of making IO more
sequantial is not very significant.

In terms of random IO an SSD can easily be x100 better than rotating
drives and it would be a shame to optimize "against" that world?

Actually, I'm not sure that IO is bottle neck of GiST index build on SSD
drives. It's more likely for me that CPU becomes a bottle neck in this case
and optimizing IO can't give much benefit. But anyway, the value of this
work can be in producing better index in some cases and SSD drive lifetime
economy due to less IO operations.

------
With best regards,
Alexander Korotkov.

#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#25)
Re: WIP: Fast GiST index build

On 25.06.2011 11:23, Alexander Korotkov wrote:

On Sat, Jun 25, 2011 at 11:03 AM, Jesper Krogh<jesper@krogh.cc> wrote:

* Wouldn't it be natural to measure the performance benefits of
disc-bound tests in an SSD setup?

Sure, it would be great to run performance tests on SSD drives too.
Unfortunately, I don't have corresponding test platform just now.

Anyone have an SSD setup to run some quick tests with this?

In terms of random IO an SSD can easily be x100 better than rotating
drives and it would be a shame to optimize "against" that world?

Actually, I'm not sure that IO is bottle neck of GiST index build on SSD
drives. It's more likely for me that CPU becomes a bottle neck in this case
and optimizing IO can't give much benefit. But anyway, the value of this
work can be in producing better index in some cases and SSD drive lifetime
economy due to less IO operations.

Yeah, this patch probably doesn't give much benefit on SSDs, not the
order of magnitude improvements it gives on HDDs anyway. I would expect
there to still be a small gain, however. If you look at the comparison
of CPU times on Alexander's tests, the patch doesn't add that much CPU
overhead: about 5% on the point_ops tests. I/O isn't free on SSDs
either, so I would expect the patch to buy back that 5% increase in CPU
overhead by reduced time spent on I/O even on a SSD.

It's much worse on the gist_trgm_ops test case, so this clearly depends
a lot on the opclass, but even that should be possible to optimize quite
a bit.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#27Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#19)
Re: WIP: Fast GiST index build

I've added information about testing on some real-life dataset to wiki page.
This dataset have a speciality: data is ordered inside it. In this case
tradeoff was inverse in comparison with expectations about "fast build"
algrorithm. Index built is longer but index quality is significantly better.
I think high speed of regular index built is because sequential inserts are
into near tree parts. That's why number of actual page reads and writes is
low. The difference in tree quality I can't *convincingly explain now.*
I've also maked tests with shuffled data of this dataset. In this case
results was similar to random generated data.

------
With best regards,
Alexander Korotkov.

#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#27)
1 attachment(s)
Optimizing box_penalty (Re: WIP: Fast GiST index build)

On 27.06.2011 13:45, Alexander Korotkov wrote:

I've added information about testing on some real-life dataset to wiki page.
This dataset have a speciality: data is ordered inside it. In this case
tradeoff was inverse in comparison with expectations about "fast build"
algrorithm. Index built is longer but index quality is significantly better.
I think high speed of regular index built is because sequential inserts are
into near tree parts. That's why number of actual page reads and writes is
low. The difference in tree quality I can't *convincingly explain now.*
I've also maked tests with shuffled data of this dataset. In this case
results was similar to random generated data.

Hmm, I assume the CPU overhead is coming from the penalty calls in this
case too. There's some low-hanging optimization fruit in
gist_box_penalty(), see attached patch. I tested this with:

CREATE TABLE points (a point);
CREATE INDEX i_points ON points using gist (a);
INSERT INTO points SELECT point(random(), random()) FROM
generate_series(1,1000000);

and running "checkpoint; reindex index i_points;" a few times with and
without the patch. The patch reduced the runtime from about 17.5 s to
15.5 s. oprofile confirms that the time spent in gist_box_penalty() and
rt_box_union() is reduced significantly.

This is all without the fast GiST index build patch, so this is
worthwhile on its own. If penalty function is called more, then this
becomes even more significant.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

optimize-box-penalty-1.patchtext/x-diff; name=optimize-box-penalty-1.patchDownload
*** a/src/backend/access/gist/gistproc.c
--- b/src/backend/access/gist/gistproc.c
***************
*** 23,29 ****
  
  static bool gist_box_leaf_consistent(BOX *key, BOX *query,
  						 StrategyNumber strategy);
! static double size_box(Datum dbox);
  static bool rtree_internal_consistent(BOX *key, BOX *query,
  						  StrategyNumber strategy);
  
--- 23,29 ----
  
  static bool gist_box_leaf_consistent(BOX *key, BOX *query,
  						 StrategyNumber strategy);
! static double size_box(BOX *box);
  static bool rtree_internal_consistent(BOX *key, BOX *query,
  						  StrategyNumber strategy);
  
***************
*** 32,63 **** static bool rtree_internal_consistent(BOX *key, BOX *query,
   * Box ops
   **************************************************/
  
! static Datum
! rt_box_union(PG_FUNCTION_ARGS)
  {
- 	BOX		   *a = PG_GETARG_BOX_P(0);
- 	BOX		   *b = PG_GETARG_BOX_P(1);
- 	BOX		   *n;
- 
- 	n = (BOX *) palloc(sizeof(BOX));
- 
  	n->high.x = Max(a->high.x, b->high.x);
  	n->high.y = Max(a->high.y, b->high.y);
  	n->low.x = Min(a->low.x, b->low.x);
  	n->low.y = Min(a->low.y, b->low.y);
- 
- 	PG_RETURN_BOX_P(n);
  }
  
! static Datum
! rt_box_inter(PG_FUNCTION_ARGS)
  {
- 	BOX		   *a = PG_GETARG_BOX_P(0);
- 	BOX		   *b = PG_GETARG_BOX_P(1);
- 	BOX		   *n;
- 
- 	n = (BOX *) palloc(sizeof(BOX));
- 
  	n->high.x = Min(a->high.x, b->high.x);
  	n->high.y = Min(a->high.y, b->high.y);
  	n->low.x = Max(a->low.x, b->low.x);
--- 32,56 ----
   * Box ops
   **************************************************/
  
! /*
!  * Calculates union of two boxes, a and b. The result is stored in *n.
!  */
! static void
! rt_box_union(BOX *n, BOX *a, BOX *b)
  {
  	n->high.x = Max(a->high.x, b->high.x);
  	n->high.y = Max(a->high.y, b->high.y);
  	n->low.x = Min(a->low.x, b->low.x);
  	n->low.y = Min(a->low.y, b->low.y);
  }
  
! /*
!  * Calculates intersection of two boxes, a and b. The result is stored in *n.
!  * Returns false if the boxes don't intersect;
!  */
! static bool
! rt_box_inter(BOX *n, BOX *a, BOX *b)
  {
  	n->high.x = Min(a->high.x, b->high.x);
  	n->high.y = Min(a->high.y, b->high.y);
  	n->low.x = Max(a->low.x, b->low.x);
***************
*** 65,76 **** rt_box_inter(PG_FUNCTION_ARGS)
  
  	if (n->high.x < n->low.x || n->high.y < n->low.y)
  	{
! 		pfree(n);
! 		/* Indicate "no intersection" by returning NULL pointer */
! 		n = NULL;
  	}
! 
! 	PG_RETURN_BOX_P(n);
  }
  
  /*
--- 58,67 ----
  
  	if (n->high.x < n->low.x || n->high.y < n->low.y)
  	{
! 		/* Indicate "no intersection" by returning false */
! 		return false;
  	}
! 	return true;
  }
  
  /*
***************
*** 187,196 **** gist_box_penalty(PG_FUNCTION_ARGS)
  	GISTENTRY  *origentry = (GISTENTRY *) PG_GETARG_POINTER(0);
  	GISTENTRY  *newentry = (GISTENTRY *) PG_GETARG_POINTER(1);
  	float	   *result = (float *) PG_GETARG_POINTER(2);
! 	Datum		ud;
  
! 	ud = DirectFunctionCall2(rt_box_union, origentry->key, newentry->key);
! 	*result = (float) (size_box(ud) - size_box(origentry->key));
  	PG_RETURN_POINTER(result);
  }
  
--- 178,189 ----
  	GISTENTRY  *origentry = (GISTENTRY *) PG_GETARG_POINTER(0);
  	GISTENTRY  *newentry = (GISTENTRY *) PG_GETARG_POINTER(1);
  	float	   *result = (float *) PG_GETARG_POINTER(2);
! 	BOX		   *origbox = DatumGetBoxP(origentry->key);
! 	BOX		   *newbox = DatumGetBoxP(newentry->key);
! 	BOX			unionbox;
  
! 	rt_box_union(&unionbox, origbox, newbox);
! 	*result = (float) (size_box(&unionbox) - size_box(origbox));
  	PG_RETURN_POINTER(result);
  }
  
***************
*** 209,214 **** chooseLR(GIST_SPLITVEC *v,
--- 202,209 ----
  						LRr = *union2;
  			BOX			RLl = *union2,
  						RLr = *union1;
+ 			BOX			LRintersection,
+ 						RLintersection;
  			double		sizeLR,
  						sizeRL;
  
***************
*** 217,224 **** chooseLR(GIST_SPLITVEC *v,
  			adjustBox(&RLl, DatumGetBoxP(v->spl_ldatum));
  			adjustBox(&RLr, DatumGetBoxP(v->spl_rdatum));
  
! 			sizeLR = size_box(DirectFunctionCall2(rt_box_inter, BoxPGetDatum(&LRl), BoxPGetDatum(&LRr)));
! 			sizeRL = size_box(DirectFunctionCall2(rt_box_inter, BoxPGetDatum(&RLl), BoxPGetDatum(&RLr)));
  
  			if (sizeLR > sizeRL)
  				firstToLeft = false;
--- 212,225 ----
  			adjustBox(&RLl, DatumGetBoxP(v->spl_ldatum));
  			adjustBox(&RLr, DatumGetBoxP(v->spl_rdatum));
  
! 			if (rt_box_inter(&LRintersection, &LRl, &LRr))
! 				sizeLR = size_box(&LRintersection);
! 			else
! 				sizeLR = 0.0;
! 			if (rt_box_inter(&RLintersection, &RLl, &RLr))
! 				sizeRL = size_box(&RLintersection);
! 			else
! 				sizeRL = 0.0;
  
  			if (sizeLR > sizeRL)
  				firstToLeft = false;
***************
*** 504,520 **** gist_box_picksplit(PG_FUNCTION_ARGS)
  		direction = 'y';
  	else
  	{
! 		Datum		interLR = DirectFunctionCall2(rt_box_inter,
! 												  BoxPGetDatum(unionL),
! 												  BoxPGetDatum(unionR));
! 		Datum		interBT = DirectFunctionCall2(rt_box_inter,
! 												  BoxPGetDatum(unionB),
! 												  BoxPGetDatum(unionT));
  		double		sizeLR,
  					sizeBT;
  
! 		sizeLR = size_box(interLR);
! 		sizeBT = size_box(interBT);
  
  		if (sizeLR < sizeBT)
  			direction = 'x';
--- 505,523 ----
  		direction = 'y';
  	else
  	{
! 		BOX			interLR;
! 		BOX			interBT;
  		double		sizeLR,
  					sizeBT;
  
! 		if (rt_box_inter(&interLR, unionL, unionR))
! 			sizeLR = size_box(&interLR);
! 		else
! 			sizeLR = 0.0;
! 		if (rt_box_inter(&interBT, unionB, unionT))
! 			sizeBT = size_box(&interBT);
! 		else
! 			sizeBT = 0.0;
  
  		if (sizeLR < sizeBT)
  			direction = 'x';
***************
*** 634,644 **** gist_box_leaf_consistent(BOX *key, BOX *query, StrategyNumber strategy)
  }
  
  static double
! size_box(Datum dbox)
  {
! 	BOX		   *box = DatumGetBoxP(dbox);
! 
! 	if (box == NULL || box->high.x <= box->low.x || box->high.y <= box->low.y)
  		return 0.0;
  	return (box->high.x - box->low.x) * (box->high.y - box->low.y);
  }
--- 637,645 ----
  }
  
  static double
! size_box(BOX *box)
  {
! 	if (box->high.x <= box->low.x || box->high.y <= box->low.y)
  		return 0.0;
  	return (box->high.x - box->low.x) * (box->high.y - box->low.y);
  }
#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#27)
Re: WIP: Fast GiST index build

On 27.06.2011 13:45, Alexander Korotkov wrote:

I've added information about testing on some real-life dataset to wiki page.
This dataset have a speciality: data is ordered inside it. In this case
tradeoff was inverse in comparison with expectations about "fast build"
algrorithm. Index built is longer but index quality is significantly better.
I think high speed of regular index built is because sequential inserts are
into near tree parts. That's why number of actual page reads and writes is
low. The difference in tree quality I can't *convincingly explain now.*
I've also maked tests with shuffled data of this dataset. In this case
results was similar to random generated data.

Once again, interesting results.

The penalty function is called whenever a tuple is routed to the next
level down, and the final tree has the same depth with and without the
patch, so I would expect the number of penalty calls to be roughly the
same. But clearly there's something wrong with that logic; can you
explain in layman's terms why the patch adds so many gist penalty calls?
And how many calls does it actually add, can you gather some numbers on
that? Any ides on how to mitigate that, or do we just have to live with
it? Or maybe use some heuristic to use the existing insertion method
when the patch is not expected to be helpful?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#30Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#29)
Re: WIP: Fast GiST index build

On Mon, Jun 27, 2011 at 6:34 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

The penalty function is called whenever a tuple is routed to the next level
down, and the final tree has the same depth with and without the patch, so I
would expect the number of penalty calls to be roughly the same. But clearly
there's something wrong with that logic; can you explain in layman's terms
why the patch adds so many gist penalty calls? And how many calls does it
actually add, can you gather some numbers on that? Any ides on how to
mitigate that, or do we just have to live with it? Or maybe use some
heuristic to use the existing insertion method when the patch is not
expected to be helpful?

In short due to parralel routing of many index tuples routing can alter. In
fast build algorithm index tuples are accumulating into node buffers. When
corresponding node splits we have to repocate index tuples from it. In
original algorithm we are relocating node buffers into buffers of new nodes
produced by split. Even this requires additional penalty calls.
But for improvement of index quality I modified algorithm. With my
modification index tuple of splitted node buffer can be relocated also into
other node buffers of same parent. It produces more penalty calls.
I didn't have an estimate yet, but I'm working on it. Unfortunatelly, I
haven't any idea about mitigating it except turning off my modification.
Heuristic is possible, but I feel following problems. At first, we need to
somehow estimate length of varlena keys. I avoid this estimate in fast
algorithm itself just assumed worst case, but I believe we need some more
precise for good heuristic. At second, the right decision is strongly depend
on concurrent load. When there are no concurrent load (as in my experiments)
fraction of tree which fits to effective cache is reasonable for estimating
benefit of IO economy. But with high concurrent load part of cache occupied
by tree should be considerable smaller than whole effective cache.

------
With best regards,
Alexander Korotkov.

#31Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#30)
Re: WIP: Fast GiST index build

On Mon, Jun 27, 2011 at 10:32 PM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

I didn't have an estimate yet, but I'm working on it.

Now, it seems that I have an estimate.
N - total number of itups
B - avg. number of itups in page
H - height of tree
K - avg. number of itups fitting in node buffer
step - level step of buffers

K = 2 * B^step
avg. number of internal pages with buffers = 2*N/((2*B)^step - 1) (assume
pages to be half-filled)
avg. itups in node buffer = K / 2 (assume node buffers to be half-filled)
Each internal page with buffers can be produces by split of another internal
page with buffers.
So, number of additional penalty calls = 2*N/((2*B)^step - 1) * K / 2
=(approximately)= 2*N*(1/2)^step
While number of regular penalty calls is H*N

Seems that fraction of additional penalty calls should decrease with
increase of level step (while I didn't do experiments with level step != 1).
Also, we can try to broke K = 2 * B^step equation. This can increase number
of IOs, but decrease number of additional penalty calls and, probably,
increase tree quality in some cases.

------
With best regards,
Alexander Korotkov.

#32Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#31)
1 attachment(s)
Re: WIP: Fast GiST index build

New version of patch. Bug which caused falldown on trees with high number of
levels was fixed. Also some more comments and refactoring.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.3.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.3.0.patch.gzDownload
��"
Ngist_fast_build-0.3.0.patch�<ks�����_�d�:�E���^_{�q�T[�������
-Bc���������� E)N���'���Z__�F7n����
w8�I�1���(��e�R?
����F�!n��y=���������U$���C�C7Q|���r�&��&�h:q �S�!*�'���x.�|���'|Q���#7Io�~��t�����	d"~�6A���#���X�Q�t�'���~��=�w����~�\���\��)O�8��Bic]>Oe<�C7��������n��	>?J�=����[9�����ne$no9�}&q��X����h67�?@��������v�M�'h����
�7��|���1hF�6O�@}|���x�����2��s2tC���/�����8�����<Y;����	�,����x��M����g����7��H�����j��e�^�SGmn9�-�(@��4^/�qw,7H`F��9y�kG&S9���CC�����~8&�����z�O1t��D)j��q&=9��|���O���y�?��U�����a
�������qB��hz��^:��F����4�O�0� ���i��
O��F��5hX�"�(����L��1O��l��0��-o�a�`O�&���J���k��Jx�P�92�3 �x9q��K�R:��>��t~x{H���L��053��
�]#�N8���`d�
�0��T���u����Mx����|/��S�OUP�.
#�;HY��G��W1�g�������;7��${L�`�O�y���u�����Y��N�_;,����0�q�!��"�#���M\��\K�����If��QO(&��qU�������
����:-[qf����s���}�&3��y�/!RC\�F�L/���(���d�f����	p�b�~2�<=�V h����\��&
�j���TV[a��">R�fI7��_������Rf���������1���/���0\��*�ZC*j��8�u���qB	.��v�aS��n��9{$��k������M��>|��?���}4@7R��� �S��)�\M�cp��/o��m#�\l*Cp�q��d�@c���;�(C��XL�{�E�6n�S��'��F�W����_��;�������6�Z�V&���u2��&`���>�\]v.�g�j��[6���AmD��<
8��3�����oZ����d2�bz�� �?������oa	/��#�J%�\��
�� Mr�cHb���S��(������.(�D����h{^%
�*�!�1��?iLx��{�(����w���7�eD~�u���u
o�*�X/�D�FZb�q�����9�%����$+�H�b��M���6ZS���T�	jY`�F�YR�z�4����Z]����$�!Zq��h��8q)%�@y��o!�y����IB�7��6L�*����#R;���e�����%��������iA����^�?�>#I��6�e~�����8t�N�M�wn|�=��q�X���� h��IWS�I^+���'�Wi�>�����1H�+b�S
�B�����6���A[c�������~����&��Q:�8�����H'���^g��z���	�3lgx�w�C��b�A���/�������D�g�F d�d������9���[���j��M��S���y����E�,<��<��zT3x"s���6�t9�x��Y���� ��pj7���y�ZFc6�}\M���DD�P����"cc�N
+W*~S��H#P��*d��4�l�N�sW�������ek����Gc�Q�1�i�10�G�+`F��ic�A�5t�bw�U��Z�����t9�XyQ���"���)�u����qc�f��r!�`��H�
0�&�Q���
��t��qAd1L,��I4F5g.k�M����b���J��:04��V����u*cC
L{����waL;�y������d���Y�`G���7>������w����3Y���{Wl���k���tK������7���t}�C���,�l���~j�_^���cKpF�4UnV|Q�t������q\�2��B��$#�[�K�y��^�M4�c(#X�����A
?���o�FBt��Z?K�C�oY�X;o��H������D����`}.`�n 8z5�D��<?��)��	8{���p���������@�{a�lV��!��Op��(�)P���)'�O:x���$J�z?;o�*���M��)
4Q\��-U:!����	R	d��d@p��d@L	����@q;�	���R�&�@��O��%����R>��w�1�r��h�1���������}�?8�t��'���^��m9���[3{[?�E�4���X[��CK�L3��k�bXPp!k��:P�g�Yj��P1��\�O�
�*VT'4m���f�����}�[�[Z
[�����VG.xZ�?q��#nA�>p�R�T���pG_!O�b��Q�L���9������*&���s�h����~��
"�r���%"(��
�&�Fl�����r�������VF��&�����4����,�*�&�|
D�����[r�0Q5������+�o
P�0/��/��`y���j�x���w���S��S�4���w-f\�F:}���U*����w6�����~��'����6w��-K��%D������kNgG~L����Mpb��6�I�|;b�aG�y���6�`=�HIg�,��s���J���z_����'�2"����^���!"�,����5���r�@���F���`�����L+k�I�'Q��*SW9ze�m�E�|r�� �ud����-�VSh��;x:����l`yP0����z^�����Y��I�W�v������A�-��B wHbQ ��?i�/���������G�:��L�|�P�0K�p���,+!+�XcpD�a��(c���|"��Q���+��Tv��3t����J:�p
�K4C~
'���j�������z����2�����������m��eQ�uP�}�i��H�G�@hk	Z�5�l�2E��)�6
��U|�a�f�Ry�WJU��t�m9�{�M�����h,G�u����������3��
���|
�JF����:<��#8����F��	��#���*%�A��9���r���n��DW�?����������TJ�<����{+Z"��Xu`+g��<����<�,ZS-�A����f�Ap	����D:a��1�3]Q��0�m�a�qI=�l���`�9�r����3x<�tX\�^D������%�+�Y}��]b��f��'���}gooky�z��r�w��p����"Dsr��2>�R�gU�Fj�34��B^YbF^�J���Xd��s?��I��UA%q����`3����*$������Sy�/3bG�R|���VS�IX+=�i���15D�-K����c���E�(�����9l|��l�����7���^�Jk�{�����v���
8_��/�	�(�U�u�G��x�dKS\�*2�B����V��;��V�v5�L��X�����^���r���:�xB�NY;'G���t,���l]X���&��%K���>�v�_w�}��o9���F{+c�L��qa����F�9�
9�-���1�R�-��3i���{y�S�Sw��dm����wF��M�E`�
"�e����b�0ZZ��� ��!K��1���-�A(p�C�l�j�E����;���xi`�����!M%�:��sv�2��=���VQ�������b�1��I����<���GU/"���SE�#.@#��&��x)tfKL�S%"�e�08������@d�=o�xO�UZ��*���
i�U�x�j�������%�[�~�4-��M[��{�6�U	1��K�7,]A�6�M�5�/���i�dg�LJL�t�$�V��l��7�E�R� �����,s��6�IY����_E�R��no:���j�:���<_+TSG�[������������������	���G��Z���n����/;ohMO�
���G77������	$�����
,�y�5���Z�}0u�Vn_�L]l�7��^aX�QD	V���rf��T�����C_]2�E�X�}q1	�a,V�5�>8�ml��m�D�$���_]��n�7���R��8��>j9c�u�D�A����>��S���w-�:o�.����)m�^�s������7K�;33&
����\��M�������,��0�]�2tS7��|[gY�l<ji)��������B��r?��Z�[Bv��-!Z����X/�
���\z����Jc��Jc}���z��F��*�xFd}I�~^���������
H������������h:��7i,euVW�`|�	;E~��_p^�RPXI�7��"�t���V��'7�g���{����Q�&��b����&;Z\���Q$���a��i�`XX;��9t���&��\��G?������I����
�.�L����*���LgX�:I��cX$h���3��_�3��E3��x��i9���dD�y�n)�8�e� kF��
�K��b��1��&]����=\��4��1nkU�������k�:��kX������G�b2y���%T�H�f��q�T����|;��:��W��
���������.����:(��VT�~ht����h��h8��Vp�g��^7��Q9�H�>�����(2��r�D��N;;�t�+��J��zVJ��*6�~��X+���1���c�����#�y���gDM	!��%@�-d�)�|��xO��2��0��]�Tb�����U��FZ��L��F���wL$hy��p2or��M�ov5���a���I�Me�����O���=h!^��R�k�[A�g)�Z���sPEL��/��}n&]���w��}�u�Xqy^���L��Rl�)�����Z�B��;��V�K�*����V5���7��� ���U
���\��;������h$�*�#��^���7�T���L�l\���
_�s��~�n���\�R"�Zx�V�M���,=Eg�xw���_�~�x�n�@A��z�D^����\��!�+^\���<!:?,H������C��8t��@��?���o.��l��(,UK����]�����%��J��"J�!Y��P����LW�o�~��\4c������od
:������s�x�L]���	:���"�#�I����������M�B����!Z�*���27d9��MI
}o�--���n�,l�x�>t
i��J�:.R��
���,5�P�0S�kr���%�=�O������N{m�v+pxA_��cX�����h�b�J�YrM�C����u*g�^V�����7��@�/Ka��f��	�����J6;N�����}-y�E]iu(���r���s�&�����~A��>��D]�p��'i�$9��SZ��>kytg�����[6E�w��t��&k��o��U���e'�5:r+���g���� �]%:��t�v��S�(T}�C�������T|��*��7�}�����+�c+�%���e�t��:K��
��N��;����=��� ����*��TC9�A-T�%����%��U�?"�U���"e�B�����R�=Gr��<����<w�O�'
$Q�8�x�5q�G����2��yW���������
1+�|�(^*���eW}��z���"N�|�{;�YV��?�COy��#y�>��Tg�{a�L�X��+����
g�}�fL�Uokj�i����G3�P�nk�/���]��F�VgZ�����])!�C\�'xR��u,	5�P�?C���-����k�)o,q����KOP��z����YO
�����c(z#5\|#��$�����Q��DL{-e��_R�XJ��~A�T-�v�o|Dm�F��/�/H5����-�E;t�\+���C��T��na^�$9���7���.���W��Q�:k��0�5������9����s�A"����w��Mm����b�T�%,^�����]����`�g���JHL,$�&�������{�k$���������w�^��b&8xi�T}��&�#�����O>TcZ���dm��?�10���,��8����>��@J\��������rJ����Kle�:������K��
H�
�����ER7����^������'�98�[����<�����kse�ZQ�Zw���sU�Zq4�j�`ZYI��%��F�x���;�����M����c�u�Qp@�BS�_��|����Wg��|�y[��D�E�w���y_�
#�J�<�3��2���k�����Re�9 �h*��7yT+T���?Q�?��9���M@@E^L/��h�Gus��*d I�yqq�6���dNlv�����.��f-W����j�3���4b�����K�M�?����LEW7��g��MMq73��@��N�a#R^Mk`���	�^�pv5������e�sT�|��7���nf�����id.�BF-�
�F�@��#zk�H;cB��V��������k����B ���!�R������	/�)U'Z"�E[Z�I���t�Y�����^��N;�R��si�"�Ns��T��m��5��?���yJ�=
���x���%�z�_���f���E�;��Zp��s�>b�Ng����{�qw�����9E�\���b�������&���!JXk����wu�X}Vl����&��"^�z�s7M:�S�����p[�2������b<@r�V�F����/���Q�,N0������U��)mk<
��� �����@����R���L�
�	���rh�8f��r��"���l�
n���@�"{�:��b�N���������]4��:e��
��z��n�MV�f�a<��T4�@x��p���5�c7d6~����
�x���@�����Tq�����-a8���O.)���CX��.��9�T[sI���+^+4(���������s
����f�����$�����t�U�k���[�x���+D��j���r���a�f�Y51�Xs[MJJ0����$������Zs�Q.���9��2�j�,c��2M�Z����q��6]�q@4�RG�6�cPM>�kO�M�k���S(f����w��=*���P�f����O �� �
���;�(o�v���+a�|_g��rF}*c���OpiN����|��9�9�l���5�8��qU�]�;�����H:��4�2����T�B
[����:��w�C��B���
+	��2��rTe�TP��,�x�����+��|��	^GU�n�Q�\��rM��������4
D�9���v����w�
���9tF�����a���20'�U��^�g��:��7����B
���t�'���i
�nS	�9���L��U!X�uIB �$9�%T5�[���*����@���\U3�nDD~�R�U&f��q�L1������r4�y�4I��������F&^�-���bf����_��Jzuv�*�4LU����GYl��9pr3�f�
P�-��� �wtQ�5��#k�:f���6(�t�e;`+���-�I	�-������QP���$�\f�)p��&���Pr��/�+�E�P��0YP�]`~�z:�_��1N�����d��������d��{���W6������4���kF���7�S�.R�I�b9�����2M�GoOI!Y�a�"�8�n6C�WV�/������>%�7{1v	����k��Phmo�cZ3o��X��%������������A5&���O%)���	�y�&������3�h{�����4d���i�a��\![`�����J���������;J6��:^q%7��l\����	IW������$/��c�������c����[9��/6����I�n(��A�QU.��6���8�9g�����
7����7+O�|4D�*2sR�\MJ�B!��P��$��`��_�ya(T
it��<u�~1�p3�.��k��]]Z��E��m�S�|���;����
>q�at��'������xs�j��@_�W��'4�Kw��O�O����2M8$���.{|w�	�w0���nkWJT���k��?�`�p[���`�����7���_�Yf���~�S_����6�h=�q!p
T����~���Zq��~�c5!��	����9��l.&n	����(�����!��$%�t#e�����r������������R��!��a�f��a:AW?�v��
�	7��p:`�O��CD�VACt]>FDKg�B�Q������n�����A��|�e���e�D|�{�^B6��<��,����|pyY��Q0�������H�w��w���Dw�[�F�qiN��$�PV:�.r�XQ��x�>���n$����� <W�}��T�8�%k�Ra{�oC����O���X�����a���S���Fld+���`� \g��'�[���Xmh����b=�"�IT��zk+�x�+r5��K������N�[E�h�l���x%�~�yD����ZYJ���o�uIu���m������*����2
������6����_��?;�f�.^n����xc�����?����qCc��&���G�z)�#����Z^|R��[��m���|�V��%������p�i��f�Ep8�B�1��9����\H�����k��[��5��	\}�?��a�o�_�O�9?���i��(@�ym�������p#U,����7����S����n~d�5oB�x��so�ova���W�z�Gm�A���=�L>�@��cX�n!)I,a������#��j�^YM�i5�K�3'7D>�u������e��Q�`R�l���������}LW�p�L)^/*���I��A��s`�`�r�&�m��������L��s��v�UR�0B�6E������6E����
cYp��	�#{=�>��������k�e������:A�����~���o�s�������"O�
�����������n���[^�Uo'?Y>�"�����y����\�MD�6��A�FN����B���z���aCm������G�q>
9��,��;S����������������=�	���_C���`����~@������b��sl��:�!�0.�[P��8������B����F`b��x������N���]�-,��}�I;�_�7����7�9hH�QL���Q�,�?��������s<�uI����<���U��TF}Lt�y����M3��@Zi�J��h8|�"m$4�OA���W����"���k��������X�J��'�V0�M��W#������F%���@N��`�����h��+'q]?���d����v�\���~|���J���r��������%!%��-R��M�s�95e�C�G�4��ri�] �<�k�.	�&��]G���R�i�z|���Uo�a��q�����Ey����OL�)��y)zZWx�F	���s;�n�F��������r����s.�G�������e�t�����O�
�Fr��-�����������!���&��Buc)�Yl6�"<��\��C�P��(��9
�~�D���f����]��zC�>��<�'������/ex�wfi;H,C�sCK��y���7w�}�'=�H��9�����������D������y4����=i-$
Qx���2]�wH�n����>�w]Y�����+���m�7�5��m>����v/���'��U��z���yGc���!gA
qOC�����������qA���r~>1���2�e��g��R�t�"�5q�&�`1���r�u���W���
�%�S��fz��	^.��3
���4K!n�2<�\�>M�2)f��f����N�`~��I%����D�������B� A�x��+�fZ����fH��?�K,��g�o����=����?�<N��Dbx�$���1��9\�f����c�,�s���*�/����k�X:)�I5��mB���^�x$�����*c�Q
��s`!�
����Yq�T�3!����!������u�����=g��������-$2�D��Na�y�;6��ov����L�
����I+2��������(������t�������_Js�U�����'#�,����	
	���o�N��jb�����;�Q�8T�b&���tH�b�)�Z���=����'�kUXP�d��-H������68�p"��M(��9���MX�P�[�b���S�aF����c;��%(�
JtL��F�K���v���f��������5~��{T�	N�a8G��\���:�[ �W3� ���mlh��N���g���;��|y��m�
\��3���������0���>D��/J�����������M���;�|i����X/��m�K���$�2���`��s�X��^�Gh
ITse`$��3r4�M[Ct��w�;��V	m�YZ��rP}�����MH��g_�����l��9����������?����U��)#�l�2&r?�	x�5�X��&�\P��S'xQ�z��&�_L�����oAb��)Z�j�X�u�q�&�<�TY����k��z �K.��I�b�'Sdb� K�4x��Y�n�m����g	j�����YJ��J����/��C%����������p����e�CS�	�W�����xr���,�����p�������28m�.���.��������o�A�~g/�B���������B�k���7��h�
��G>Tw
��x����v|��.�2o?,dW��0zQ	Y��Y���)t�t/��7}\���]���5�����s��������-�T�!,��$�� ��{���oA�:�.�=�o��L��B��D�"����r\�����,f����T����x��D��i����H�wlf:��c��e�/������x�CI`�(f�|
���!��&hQ������N�n'�>L'�����0�FS���!P��,UX��z�fjC���p3J�	/T�(���n�&UPG���y���p'���rS�&�d���)T#��j���Q����	�����^�����RkM������c�u��<��<�`O$`x������6����w�-c	�~*}}��a��N!=����R��--1\������M�6����?n�6�q��q9�����X�U4~�9������{��'g�Y��4X�Z��=4��B�+����!*}Xd�Xk2g�z�	���A������S'u{�$~�����Y��>�������]gM�����)l{g��)k��5�+����	Bk�@��n��H��m���������k�f���*���b���-��p�4�I#����X5�����������I(	��N^����O�4\-��1��h�����V������'vR�8��c3���g0��e*|������B)�P�}���;~w�ou�T�VS�
.�q���^� ���0z�*��/D�2(9i��H�da��)���������Ed_�M�2�q|�}���K����n"^i�p(�W^D�o�l
�H!�Wr�F�@�����k�R���}s��FJk���m�u��\�&���8��Z��La�[��J��H
`S�FQ���[�IS�������}Y3��9�M2���>K�E�9�����J�>*?�������jq�^��� 6��^'[�:��uO�c�`mU�X�F��B}���o66
���=���~���%/�3x.
Lj/s	��Q����j�����N�h[���N������>��4�����������+^S�������������rQUOt��W����B8�9����2��T��������W�/v@m"C��R����"���U5������f�^;@��{%],'�z�3=�_���pGE�����k�d���<��c��e.*�>�.���d���n���l�{����Q�����^7�7��P��a����������;lL�@2y5�S�
�jT�������^��a��?z�����;&BE{���2�������E��y�l����
lo�;8��;�=�=����=������n+3fh`�}j�����l�(jX�a���������D��bxBs�t����c��=?u������omU�����B�P#O������ET�0*�������
��J���0L����8��"P�i��|�����}��VN�x�����������/�q0���hq���@X#	o�����x�q����:�F.���:}H����#����-�s*�W4yO�D�v��[rb3�F�����t���{�Dy1w��_��6d�p�	�G�i0�x*��\rHS�c�<�3
���V���v.�Tr���[k���4!xC��)�Et@�v�����<N��O�S���`=P���.��i|�d���9��S��*�����[u<�R�v��S��y ��V�#��w�v�����R)C����*5�~�"�����.)��x�l�3��
Q���u=V�<zR���r�OxV���fJ/
����MM>L��n��fvgA�0O������
�H������z�\��6�������������t�f�=Hq�\����E��������h����)�/��/��{������^�y�s������Z�s�H#�������S������[+���f4!�����f���b���k�j'BV&�Eb�i��;H�h��=���Sry3�q]��.�au~qJ��`��vc��7A�CTG���7�#�'a9�6��O&a���;i#U�E!���$w���%yn�LlL�5n�<D�7������1�8����t��I�����]�T>��E.�D�dH������{���8<�}p�,���{������1������l6����~��!��aO���,�Dx�{���|���$c��iq��M�e"��r<	��r%�\��x�O9����,��k3�C��I,W-u�+����Q�]?���/�1���m����9�J�O�+\/U3* ����%G��Ri���@#�������te�o7��`P7����\x�'6fk	#�2v������Ux��#����ny�A7jL���p|��h��3�8���@����Dh���N-�C�[���"���M���Q]Ik���������R���Y1AYBdtu�p��:�KP>���N����@�
���?x�c+S����w����2�j�_��6�Oo�Q�[�8+?f_V�=J���^�������e��5 l�Xo����1��F��j���0�����,���w��Sm^7�c��DB�,[�?���2��c-A�����1��9���^vSn;%-��#���+�j���TXp�f�7�.i�0�2��p:���]n�������������*D���� l��[���&&Gi�|�K&~�5�Zl�9��fN�S��*}����A��l5�'�`{���������t.�Q����u.��Nm%����<��4�,�\.>�K)|��QV[8!������d2���6��U��7r���7nd��L9O�9}��B�hU�-i�w8��e���%�IB6�J����p@����83f�^A��1��u���f\�+F����q��'��Z��>dip����o�dEpF��u���vT�n}���8�����A�p2�v�KEc����u��DF��.,;`���Xhf��/�]U��r��0yEfy��|�E�g���DC
���IM���}=��]����H���� =�O�NJ�����]|F�~�[xe:�>����a��yRY�a������������B.���-���	!�i�S�J�Po�(g&$|��`k�����7x����?1���lj���nv���eb�w?Ce��M04_�cPmG@�Z�r�8��_0J�M^n���Zj�nw�Y����12FQ�J��{i���>P�=8~	Q�.J~����IM,s{3u16����r�}�{�����w�F���Y�!���u=X����>�`R��m^Kf@<�-�q0�P�5H�����^�����s�z��O�e�.V7Q�jI�T�bk���V��F�I8����&��D�]��E��������,���,V���w����Y��6XsM"G�W��$��p�&�!
�����������5�z�ce�Ez/�fo��6i<���f��yZ]x����K�� 9�����bx)�(�v���+����V�*�`�Q��d�Tx�F7C��U����@��5��e��������W���Xm���Bx��%�V���}��c�����Kgtz����+og���b�w�Co��Y���]��|�'�U%(4�ape�)�.�Z�pSy������"�4]����G$��������]8"-�
D�>�ze��1����V�A�����%��]l���B����u����qHzQWbi�>Q��X�"��1WvQ&5���/E����W��C5�9��9�;.L2����������D� �����M=8�N�nA�r@�d���JX�����}��(�E�*]1�O�G����x��lr�g�{�3��f���m\�����S��(J3��$O��<HQl�9�����w�����/x����sx��v���Es�g�3^�������!_dZ��y&`�-{'�]��Gy��o�����l�s+'����w�(O�)8��p���6'�b#�D�`�&��F������\-D����s�~�W���O	&���>��8{A�$V�w��S��������Z*j��'�k�q-�'�JX�M�h�Z������6��l��}2�3������E��b��}#v����h*�����%�Dq4��C��C��dhE���,\	�I1	�pD�jr����>LX���O�x.�x�S�I��(=�e�B����R)���9��I�%-q���"9[��P�����rV�kV������
s��`V5���)��T\5���8�Z`��	@��G���%0�"�VTQ�Le��.�`���4~�$$2	T���)�P�mE�')�&;�iZ}.��:h��^�kZ�g����I���b���V����7m3\�8��rt5��}������������]+�
g�J(����w���T����Mv����]F�
+u�(�e
R�}�-��4�^�.�6�D��s?��i"Lt���
g��j_I�C�/j����oC���B�
7^���N��{�Q~����Kxx�-��g(�@{�`2	�]<r����V�z3Es�{�������kG�]B(~��:���:`�%p5"�`��Na�Q�H������Uf5����c*����A�l��t0�
S����P:�]BLA����E��$a#�C��_���!I����O"�{��<��Yt���m�p���3]�!�F����1�����^��)a��GPW�L%�����%��C�������y�X,�9}�������l<�s��eDH�}|��v��g�j;Z\&�g^'���Z5�@ ���y�%f����<y�����*��O�C��[�?��"_���wHc4��#��)��<�z��o$2�~���u����@��1�{ �����	M��N��`����)Q�uaY
�|��(�
p�.tyB��GC�r.������4����;���t�MN-r��v��O���>�V#�/�i-�+�,�PG6r9��E��H��v�����yR��������a�����S�v���.�+������W�.6�+���2��%��t�U�� ��pdcQu4�k��s�Y�qzVl`ZUT&��L-���q��t8�n��ID�B��R�saZ� �����j�0"Z~z�>��S�f�@y��6*U9=��dU��b��fOGn���l�C[��Y�,Y���\|���7��U�nq[R�H���a�n�4�jy��b����6Rp��?-����U��?�`��	6��������.j*��O�=Q��Zb�$�n��;�s����RLw�ci�+�/#��������=�>[N��22t��������y����6H�k��.'`���^=���/���Db��&�	z��q-���W������JIa��;	0���fy�&)�'�����wl^lLn��,����U�b�����KRp��^�8�0����3��k��97���W������b%�6XCG�"1���^�����5#��n�G����	�'�c`������fw�[jkb��>`���D���>�^����`���������H���l���fea�M�R�oG#�ZU��2��fH��Z�14����-��a'�������C�~�+����av_���;�\O����u"v��~gv��V|�7��A�c���8�4I�CF��lt(��V%�e������p����e)�i5T��[k��x%��d���_}���t�d��l�l��{��,IYV�#��
����Ni�C�@:�f��O�����NIC$mc����!����S�B��`6����4��b�&���H�=��/h,�X��+�p1��� �l�'c�Zi�G_����I���N�f������A��`�����.��w}]!d�9&N�+���������s�+�,.����t
����d���b��&�I���%�Ou����?D�	���0`~F��������W���M�j2��I}5� �������������������I��hil�{	0��wq5�W�8�B�H�O�7p��}���\��U��:\�1�9�S6�bv�
�}��J�u3��L�0�@���F���j.�eb`W?{&*�c�W����l[%O�*�A���R�!�v�s(1	1k�q&d�u��]]�Z�-fD��n���f����B����;�[�+d�?��
���!w��>�N���%};�9�-�����|��i�	��lY��(�.���l����u��;OW���O��8����!����2��:�������"+�{�x��W�����o�����S����J \��a�O��-p�BY�OkV���3�4�!��/��}+���y� ��NB��0�X}0�w��^'��#�OS�Y���0�L�Rud��DD]�j��������7�h^�t�
�	�s��#�{^��p�������W+��9��l(�T�W��T�N+c4$��Iz�f-x>������S@��%5�R�|���f �@o���\p��ZQZ-!�g���]���0�
*t�rg�_����KU��>9.��*(�I�<f����hM	�����������p�s��l�%�9���}�j���]o1��li�NFiu�����d&�/�F�����c��G�#���;��/��mQ
�;�U�P
[���d��,���"���h��>��a� 5�e�A�����k�L06b���0f����j~si$�S��=����|�/�}Y������������#����#LH
I��E\����
n`���b/1G1� ��z;�Z*�Sm�n��6��i)tH���BE�a�_�7e��O'A<�w��t���.?�,N&)S
Y�{���+	��Y����M������>5�be<��qm�QQ��[Q��_\��o�t���zS���(w�-�wp�a��T���>]�Ko�G!r&���J�e�%8�TA�J������qm�5���O�%�5��vq���=v53{
\y`jY��U8BK�h�I���gu,����U�_�
��@8mH21�
 n
�������|���p�JZ�v��	ec��mJ{~xpp������������L���8T��n��@�"����K/�����:c�j�U��]-6YrP�%	be5c���{�����A�y������r�!0	dn�G�%�?p��G�9�y�om��)�&��a�
)o�GB��Q���^���B��\bk��r�$��S��x`�@�}"#��P7���	;Q�����Wiy��s�����N\���8>u�h�����_y����W��,�d���{5�A��Lv�[�Z-�hZ�:+����������GW��Um=*P�%����b\M0I���"�l-!]&�	RF��S�]bp�K�J	�fn���T'3w�����I*�v�����B�>���W<;�����O�-�X>?M�tI�3I``����]���ZCb	�P�����	c�`;����|�I��`�>$�Ex��$�{"��[j~*o(�_����pQ��}A���M=@��p�,T�!��%�s!N��;Q&z����k�)���kF�J����C���0!�L��<QR(4Ud:�����0���>��z�x�:�]�T9=�.�XT�'E���eLq���v:�K��������h����B"����KT��3��)���[1�;��k�!������7�����*��6{7���D��Y8��	��Y�]����$�o�)�S�5\��,�b4�P���@5�Vq���
��c��k���N#hQJy��-O�F(���-TLm����t����J�/�xNvY������J=�u��3U���)8[�7�o�gh���VH��!��b���|�I�)��S~b�h�>�:7j7<���k�x0��>�K�syo�V�#�P��\i�PG�o�H��%T4�����%Gg�!%�G3lk��<�jb�r��m0E��,�G��W��1��R(B�jy _�%U�lji\W����.���w�/X���d��v�	�)��Vh"�!����1���:�/n	Z
V�f�b��n�BX^�%��t]��/+E\y��.�t�z���X4�\�
K8N��&�R@}����L��a����=�@x}�������u���0yY��9�� H��������`~n!��t�`j,���_��ygOMh�1Nm&E"��0���������������U�
��H-�.��%xS��:�qJ#���/)L��b6P�Asw�(I��������J��:��������"fU��8������*��G�7�+�.���)���`8�f�� 5��3E�W]3x9w������e9CU��� _B?<�$E��g��i����8������{��m�0��C^n�-�*9�C�m�a��\*[C���wA���Ab����G]��K$3���"�$1��e�T=	K�/i����<j:����E`U�	R?�g��)7���[�M�7���E���o��`c%�-����l�6s�&3��F�n=r��i���4�$w�	3�)C�C3�>��3�/�L�����O)���K���A���;Y�����/s[��"���Ar�m<J�s|��P;~���r���f�����\��b&j��o���P!g���Ng�7.����_��
��q�o�n���Xz��������-<� p5��k��I���[<P��H{��%�5��
A~0��������E�@���7RI�}��Y���=���)ss� ��J�.�0�<4dH��/�����p<��c���j��!�m���b�+�^5��B��w��0�@D��p]�z��*%.;��Ex�w<E�c��&�~����hq���{O��������4B�/�9���S�,T�{��n�'(���-���$d���W�N�DHr���E�i��y�q�u�u���\��<hd���3G�c$>����D� �+��7Y$���=�5]����,�b��V��-�W�+wk!KaW"���b������*���A��4���E��*��]�m@BH���e�W4,�@�{��\
����\z��2.*�R#�U������X�����"��V�-�7���<�����slI��.f�����<��Q�xEo�o�d���'�?y
�wF;��������z���[�Edh����L����d:\I� p��[��=R�W`Nl���\��Y ��L|����=\�9�����DiNeI�@�jLT��,�K����7�������n��z������FC���z!�������������'ui��K�+��+��(������X�&ec��r��-g�v?�9�.�;:<H\<�(��6�^w[�'��IOv���#�V?)�������>�X�L;����x[�kx>�P����������e`�9j�[������Zqs���6De#�h�SuS*G�PO���]�d�\S�w�i=����.��������u�w�����`��s@P6?k����QU�J���rT��%�D��d0�����]k����PV�Q������R}a�����m����jj�����3�z�c���,�����j��mvFZ�����7��J�2��2�oz7�U|��w��+o�"�6w�������P�4���z��T/|w>e�L@�Q�������������n�a��Q��[bB�\9��j��j��N&��lxKw�"�9I����x�j�Ae%{��$:)���fy/����=M4-I���I���d�6O��U{�m�N�^��-�2����A������8�<���`N�`������& y�T;��^�t��T��g�6>]���u�
�N��-5\�<��W��R�L��.�x�Z�����e��"l��FX}���	�)t��?8�}���^���D��i�Pf����_LSfM����������}��q���j�f�S�,r��wd����b���-�	>c�����F����'�?!c@4AMb��J���jC�PQ��Z��M�\��1x�2�[XU�KX�����@�\�s�S�������a�����~����p��#{0��5���(YF����%���&���F�x������R�zCm��4�1�����q�"t����D���Y�.
���^hwFe��,���,{��<��Z�i�LZ����b�Y�������Q����cn`����SD^��J-�5I!��+����x?���a\8}��T�j��>g�.�$(u��y�M�<�U�w���3$�����E�L��0>�9�M3���[���:��L�7{��/n�`M<�M:�6��O�t���o�h2t�8�^��UahO�y�o^�MC"���������'`]�����y�9������;$Hj�	p�0DmIh��.+v���#�g���x�G��_�7���X$�!�	���T����..���g`����-���C:�������������'9�Mm�i�.����AF����0�p29	3B����('j�77R���6s��>WV�d�^���'f���>@��u�I�����
����'�\��a��}���w�x�"�����o_�_��q�{���9|}��W�=Q����� 6���}�u�u�����9~���aUtz%���.���`��i��7�n�����u5�������[��0������[H����w�����!�} ���{�B�����29��;��?��qo��V�n�k��fk�5p0a�q����0���uCU�)��
��o���[Y���+z���W�0���1�i`]�K3��������7m������P�vB�������?�r��9*���}����z��D���`�����0u>AE�d3�9�����[m����L��u�71t�y�C~0����d���J��6�l����p����G��&$G��c�E�� L��E��f��pa�2�����C�������M���m����7����#��KL���2�q��u� �4E�2S���o��~��'D����QY~ �~P�@
���wh�6d��3o�������;v�Qax��Pj���%�bqk�f����m��: �"/��
��=s��^��v��GW�92����o�[E_}�u����U�����h�t9�U��:����.�������Ni<���^����I8.�c��W~[��*�h���y��B5��F����������k�j'4�Km���w���1~�)��WW<P���?H�%T�J����g3x�(�7j�1�����a��O��9q�������������Mu�QF�b��k���(-������v��#R���,z����g���z��#@�pg\��j ������s����v�?���X��{��v����a�w����e�eO,����	���p���(�)U(�B�S�2_Yo�{!��*%�C���"2���o���;z��X-������(+�&�C&mEq�[<^]Q�6W0(�y��z8�yP?������(����"��$*����J�"��2�%��^�-��B=-����"������1�81�DQ��I"��I��wt;�������t����O�c���������{�^M��{�An��Fj|k?O���s���dF���//�z�2c��/c7���5���6'� 2@�\<~��}��;������g����b8�$@m�#CI�J%htbu�(7e������tq�(����2��"���-7X��0����d>~����'�
��W4��>�>QS�.�;��"-j�k�5@�Bu<P�@iI9�)����FtY&��
���/l�"���&��?M�=��|�N-h����pkN�����@��@u�}e��h���N�@p����T�d����+&��Rt~Hf%'cz�)�c�X�g	� ��o�xP���t��*��t�r#k�(�PK�(����B�
R�aV�j�I06�N$�5T�������&�	�^`��-��4�"��r�
��-��Ti��kE%����2����.����K����ssP�����������������x�y�`
�!�K������w'����5�)�u�R��Q�ik,*�$��7���������/�����7�&y��A��j0�a����
���Y������:h�S�5Q���
��ka�	hR��������ut����ft�"���g����m��>I�{��{�s��P�-bq���m5�@���r~>E'�m|�������T�����cb��rO��}�a9��;�Zs\^\�p�)��P��<f��9}���������d=�U�������G�0���*�uy�l��Ky�Z��cKx�L���� �e��E< ��7%�Rg��f�x&�9���"���EkP��Q���t���l�d��+(R���z]�9m��2�nr@�4W�d�����O2�����T���en�>}�����
%}��	��5$4(&���Kx$����D���a�V|���&}����[�������<���0��M��6�^
�!��9�����	pf��������s�R�L4�������!9���A���.��q����|�����fH�^�^���C�e������w�{/�����Z�����6�O! �SU�W	
r\�yk�&�u�*�`�q������ d��Ke;4f
���������l3�?�u�n{6�N�.o��3[�����v~�0�����/�+$>�����va-����
�7fI�
��$:��qUB�N�@�a��}�M��w�y��M<��o~*o�?��������
��7�	��b�c���qO����r�����
�g�fma�&���)u���	�����|_�e��ib�N �������c��`cYby�y%�7��Z�T����hJog?K�%�w��'N���6�oHn�1
���g?���6_,���M2���������^w����`�p^o��<�W�����8��ow6���'����+�%,\q��bw����%����t���,��s����S��!��o~�seC
�s�K4Z�$��|��2����0U��_�
��k	:���
k�: �.m�e=�^k���2�?�D�.�
#33Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#32)
1 attachment(s)
Re: WIP: Fast GiST index build

New version of patch. Now, it seems that all code is covered by comments.
Also I'm going to write readme with general description of algorithm. Some
bugs were fixed.
More options were added. Options description is below.
1) fastbuild - whether to fast build algorithm. Default is true.
2) levelstep - step of levels where buffers exists (if levelstep == 1 then
there are buffers on each internal page, if levelstep == 2 then buffers are
only on odd levels and so on). By default it's calculating by
maintenance_work_mem and indexed types.
3) buffersize - size of buffers in pages. By default it's calculating
by levelstep and indexed types.
4) neighborrelocation - whether to relocate buffer on split also between
neighbor buffers (my modification for original algorithm). Improves tree
quality, but produces additional penalty calls. Default is true.
Varying of this options should allow me to undertand tradeoffs better.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.4.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.4.0.patch.gzDownload
�	�Ngist_fast_build-0.4.0.patch�<�s���?������!@$_n:��I�oi�K�����0pcl�6y|����<$Y6�����fZ�tttt�:bssS�[q4��v�72m������a8���V$�p�xa��?�j5q�����?�.���vS�W'n�
\��/��'��y���8��`�~[���B�J��%�B��Z����^K��n�\/<��QON�������5��\���,")�a$���7�U_x�H�������e�S����M��OOyR��Z�C ���:�puC)��L�$�1�*���� 	�!�p$��XF�da)��|&�@��M�{0<����f�������u�������p��Nso�huD?������<�K�1�������wa?�_�����tD����[�����~����v�|�U�C�L�b��y�/������g�hr��/��b!�������{�\-��n�,��4��!�p>����H�z}K�d�`�/����lw���z����]z���W�wP�zHo�����{?�����p���/�;�C7�����L�?|:����0�=���e��n� ����fkZ�b��ui��(��w��_B��\E�Zc2�NE��v��Q��?4^�c�qw"�H����>}�iG�[39���CM���F7^0!��Jg'"7	/����}���&����1���|��%���q��DDFh���r��jn�D1=�P�%h�~��o�,�N� �� ���Ir�
���B��4hX�,�(�`,���A��b�h�y�A������5`�	<���r+�P��J%��C@��C����������J�����X�]�{��!
��RC������K�v�p;�8<dV��A���<�1^C2��y"�1o��/��7J7�D�S
�P"����?!~���pS z��5�����u������~lzq����S��N|�{�2m"t6i�+�X�B/�e��Fj�H�,gpWh&��������a6����bi��wOW�n���G!�������b��gV]�<Gy�����h2�h`��q�gRC\���L��k���?
�c�)3XU_���8a1/�~�E+ ��T��M��#����@YmC����o+�U2B�G��B��������c���[�L����j���4� *�}9�����>��
RQ��Al�^�Hp���c��B�rc��5�{������u���������]�@�R�� {3���\��cp��/��M���T����0������awa"�����wn��	O�
�T2��^�|38{w~��\��{o�*��jeZ��z��h���1oN��{py�9����
�n�v�?1������GG�|��i�f���5�lxO�� ���?�a�Y��������;=BQ*�Pr�E�a0���?K�$�	1�$�N�0~�!�?n���H��.:c��������br��MD�^��:�;0����s�-���)��h]~1~�M\�+���hTH�C�8)��z=G<g|~}��"t�i����b�&�sm�)��sw&���,��s���,)A=�4����Z�H�^�$�!Zq���h��8Q*%�@y��o.�x��)��qL�7��6L�2����#R;���E�������������iA����^�?�:%I*�6�e^�����8tO��L��nt�=^{Q������s h��qWS�I^K����'��I�>�'���1H�+"�S
�D�����6���A�`����h��������b���3�������t�����T�T���#�as�
�O��r��R�>��v����v�;�?�����#6�B�l�L�W�e�9G0�lk��Q�}��}o&���;���#�H?���RB��Z�r*�d������.���W00;��sN��FR21P�h��K�W�><6�5�E0����������������&�8�
�0�Nc���;���� ������Y���,�8����M�#{��ft;q�6�\A�2vWZ�I����iO������e>@,R�m������3Z
7Rk6�/���HP���`ch�	EYY���LQ�D��"8�DcT3�����-+���H����^�R��Y������NEl���i�������ig 4�#w�^���j4:M��pV�����\P�����Z;�ay&�)�p����z-V��nA��;���6�P3���<�������9�������G~�<��4LI*���<��L5���k�w�e�HET(�Ec�`�h8���"��u��:�,0���qVX� ��n��X��%!��P�����!��,}��7�q�������D��5���: ��b]_p:�j2
G*Vy��MP�N���L�����M=h'��] ���;d����@��d��>G�X�z/H9!|���l�$�P���)�y�U�_m*
N�����5Pn���	8d��M�J ��&�c���� �`J����	�� w��7��P�x�~p&������=��M��;E{�9 =����g�����Y��=;>�_�*Fl����������h.K�6���r�Z�U��jF0]�%�����Y���*=#�R[�����~�!h�P�"����q)��w��>��-�--���_]\t�c<-��9y�� ]��C�M*i���/��_�S����&u�E��N���n	As�R���I8Sa{T�c��!p9����;�I�O#Z
���]M��F��kn�D��j����J�8�>�R�rn/fxxI�������-LTM�(�����7(q�(��/�ay���j�x�����q]1���1+��Ky�|��8a��gpA;�R��F�D������k�~�9�D:2Yk����$�["A�o7��vz�<�"j������4���0�����=�A;;��sM����FJ:�`�����&h�U��M��F
G�>����h:���CD�YZ	Ak"�r�����!���/eA���h)���VV{�%N<��a,�L@\u���Q����1B�������<[E��������-�b���������������FYO���Jz���vvw�b�oIl*��XH��O�����������w�HQB������[�f����6�e�#aE�X���WQ�2�;�DVE��Q���a))�(�f�|Y���t,����h��N���h8{
�������-d�v�i���8C���!K�R%\������N��zT����K-AK�����S���6e�FN���/0l��J��J����������Oiz��%Dc�.�K�W�F�e�q&�]a�����!@Ii��;TU�'Y~�1p���
av���Z�:h8;��'��kN%�j�Q�Gffl��}�B*��9����b�����[9��9T���f��j�(D�l���S���K���t<%��	C�������Z��Af���+z��E�����`�!/�����I�������2J�(�E�(�O1����q�6��<�D��;������`����e���j��%!���%�����:<+�7R����>����@�0����)�E�n���)��N�@TP�gZ�GLc`��Y�S��$~�]<{�}�����q�8����t]���,Hb�Z�yLCN�`��A?oXz0����,�/��"i{���sh��{�~�e���BWZ������p�4��o��J-~�M06@���K<���3']�r��U�QJ����:�o�:��M���31/b�O#^�{�fc��+���	];e���]���:�F�ua�J����,�*>}�}w����R���CS����Ri����V���D�6�4�@����JY���O�������ON�����������/���vD`�re��b�0ZY��� ��!K��1���
�A(p�C�l�*�E����;��x�i`�����!M%�:�m;{��Z�V�Y�(��}���o���P�4TC�C{����!�]2��"�1���@��\�:�%fd�����s�aAGP\ ����h�xO�UZ��*���
i�U�x�h��qhn���-i��r~��f��cs�W�3>�4���h������%�5�����I����L�m��N�ay=q[��5���L-��2��������
J�U(���f���U����e��Z��:
�F��I�ehuf,��b�U�'`L�/^��\t��tT~�yMkz�l�5~2��6e�<�N ��������'\Ko��Sw�huhwT�P[��o��W6�E�Q��%a��L����%������d���S��2�br�8X,+�yp�Wk5�M�D�8�D���]��n��x�5�S��(\�?j9c�cu�D�@�G��<�����s-�:�O�����	m�^�S������7K�;3s&
����L��M�������<�n1���2t�'|[gU�|2ji���y�����}�~6�V�������[B��/��X��
�
��\x����Jc��Jcu���j��F��*�xBduE�~����������
H������������h:��I$%]��
�r���=a��.����9/��	(�$�@}�o��@i+r������3Zu���K��(G����Gl~�����x4�H���N����&�v��S��-M����Q?�����\VO�#�l��������O�����t�U}����;�E�6(h;e��<���y8�H��l���-m�@F4���������H���f�o�@�����������K�Z�{�k���:�m��Tp0���f%[g��c��VbZY����WL� K�������u{!.���tc�mg�X'B���3W�P��^s��7�����r�@��]A���%����"�y�5\�I��V
,pTN���G~h$�.���<������9��
G��R������r��?��W�h^�����Ce1�y�����h���	DEB��}G|}L7_�&~$�o�v^��.w&���Nb�*^�V#��c&b�}�H��;&b���a8]79C�:�7��aaS
���t������2U�������W?N�����r)����!��a�bs��"&���X��>7�.N���Q�>�:Y��<�f�e��)6�vO]�a-y�����S���z�Lsb@���G����]�}��'��Y�~���Bg�iz�@?I���X��k)s�
>���76�"�=2���Z�.�W�Z��8����^�j���F;MO��#�w�g�����]:P�����6�W�=�2���q���������+O����� =������1']p7Pw��w ��K���"/
+�tyB��'GW�(��s	���a�������M�pHIm���7f?G	t&���WK}k�76d�����9�o���i����D�y���������ylf��d�&T!�\g�
-X%m�t������$������m�t�\��r��w
i��J��:.S�����,5�P�0S�kr���%�FX���=�������V�2�WH��1������=�N������U��+���*xY%��v��`���,�5��q�1&P���J��8�2W���|��,�Y�u��!��
��������9��X4uU������dtBFhI������1��Ro��1.�5O����fV���������-g��������(w���t>\�t��RodL5"W��I+"��j���?������Y����7,�����r�AO�%2�,].��5,;)��'J�T?r�$S6S��K]
���P���N��c|�u��DVW%B��8�WiT�n���
�6��WJ������{
x���?��4�D�����z�,�%X
?���^f]��_�/x�sV�'7����
�"�x���W]����yvz�8��!����i��=��R���{o#S�+��i�3�a�s�<[���*����1YW���-V��W�W��\m<��hd>�VOz)>�^�i�v,j;�+jt���q���QY��$��Ca4�����*�6�������]��.=F�����������6�m���O1vj;�G������77|'�'7�h�����*����zu���3����.Uq`����������P�[���Ld[�A=.�]:���[�"�]L��9b����L�YyRV�A)K������G���Tn�F�S��P3��n�}49�b��^E�'"u�j�^�+9�bn��p��*�lD\g�X�P��
��}�������N��0��zj>ew3��n��`M�f

�1�I��lk��?�1p���<��8����>��@J\������oD�&�m���K�e�:������K��
��
�}���ER7����^������'�9�[����<���w�kweX��\�;�/)�\5�VN�Z(�VDV��C}I���)k�c�����d|�}�=�I��)4������;{/�:��&=���ly���iV��K�%FP�#x(gS�H	\5��!3���Ts@��T��7�
�*���U���_���m& �"��x�b��n.S^�$I:�/.�� �[����G������"����J���W+���^J#���j<�y������f*��H�~�~�)���L�(P$������W�Xj�d��!+�]��3j1n�=?A�A���3�+�
�>(�� >���}d.�BF-�����?�@G��L�����_)jw���cR�7NG�s�HK��`��PfXLx�N�9�)�(�Z�����N"0���n=��m��J��S+��V.bwZpm���o��-�h��@C�6�P���(V=��/��������������������>R�N�c8@H�Q��S
��,L�;o��a�������b�<��V�&���H*�@1P�9n���~��Y�7�b���l����������������m��2m�N����'^�!��r��h��l�o���@H;��bA�U X����Z=6]�m��A�;����z������)�jF0qv���0q�_+�1�����0pC:I;������b����23GX/)��>�6�N����$�ZsI�[��Z��o=�Qda}���Ax�|���1V$�9�����v-���C��hc�B�5i\�Q@%�~�&�]Eg��Qt$6�lOSI��L�>h��)w�0��	�q5"���C	l2�e���2j�*�F���Y����a;��P���>J��,����6� .�8����sv�P�6�'�`��7<0�Sv4E��;8Pk��3���3��
��_�!h1��n����0b�U��&�6�q�p�'�4���Y
�.�����e�:r�~�_�uUvC�{�\b�M$�pq��-|O�5�B��(��m�Y6�������N(��JT��� i9�2w*:AU�c5�6��C������U#�p;X���
����_����F�L3g�r��k�<��uq���Bj"kr�%&��d�(�{&q���x3(��0(���*�E<q~}����v�P����D�hr6��Qs�K�����')P.��)�BG�F��t��p�F�3(8�DY�<��c
Uq���,b�+���@�@���d�'O5��#�79��1��eG�������<���������^oK�T���=4����n��s�/�0�a�B�n����`<�����:�"�1����c�E�_,���o���K	Q���1�'��%r6����
�T��nz}t��BM����(���9v����Yu%���C�@^,A��
����Y�L�G�L���CA	]�Qw���$��||��7�a���f���m|���fU�����?��8�E��0���^�X�4��Y��N��HLR���jE��$���f�c�7o��T��'.���k@4��oi��1ii��&I-��N8���C����q8�6I�k��S@b	�K� ��G,p��"f��~PB|U)�U1{�����qOII
\�)n���;������������$��$�2j\����	�i�;���"M�0vF�����#������;h�s�+h�"�� ���4+���.���,!���4z��-N(����J��?b�����19�'����{SAq�=C_|��0:k�H���x".F�Y�)�m�B��X��m	���C��1��C"'as�������������������D��<l���#i�_��J��}���f���\�3[
�t��]����U�\���I�c�����X^M��+���jsF++�A'��$�h��a��CuQ-���
4(���I��1���Q��v
�Z��x���1�����Y�]FO��lLw������E���6(�@�q�N�i�H��
�8WM���	f�������R��!��c�f��`:Ao����Km�K?�!2A��� ���.!hd0!�'�hSv��E~
�����yN���E�����@����n��_W��^�p���OMj�a���������FT��T�Pyg��v�q�������w����rJ������h����#�F�0��{�G�X	�B���"�p�^���e�f66�[��O������`��+�v�}�I�D�����v�~�)��6�V�v���2�NH����x���\A���;w]�3���q1T"�
A�n�:���5p�O�����}�W5�H��&\n
=����bw�h���I��_���X�o���������S�5�A��LV�UQXn�|����b��p45�3����38=JG�+PB����g����U����TA;����O�(3�!�m�<�j~�t
�2�����?N�-O&E�t'�DW��g�tG�����%�~���o;�nv��y�%�����yk6;[o�����������#X���oL�wX
�a	��p�q������V�m?�v��k/I�"��	�0�{��5�c)����8�7��i�f�8��i�@��n���+��);dEBV�����VoH��f� �bF�[��g��4�����#�m��3���M����a�?�8���K�2{YEu<���\!0,��oB�cP�c:�W"�
nINp���8*��L���)(���fY�����l�]2xf���!.������w����|�}p���"�Q�����T�Ge������N�{U�LJi��I�d�Y$��;������Sw���"���+���h���8����1�{ld=�����`f�������j��-�0V\>S��E%3�)�z�|m��[�]	.R���w�Y�Q=���4I<o��b�r�&/l��9?l���4y�I�h��ZG%���(��#Q��N���]��M!����Z
�����s���v ��wH|���8�O����o�
�n��W��1��.\�N���:2������h{��M4�h;N�`�p�������/��}�6f���s�2����j��������������P����f�,����m�����f�5�Ym
$XQM�x�&��h�V��c�O����9EG�2���1v��g�=-��{��]h���r7�G���PE��g�h�0���X
�V�Ra������_�|�}j�G�{j���C���)%A�?������-��By�|�����w����^�������gQ����E	y����6:h�)g��v��t��ZE�C�W�)�>��h[jZ���2s
VB$Kl���+��U}#��~S��f�������X��y�k�%l_�G�6
��:�Xz�<M:�MioX��t��&�wL���L�U�h��~��n�e�4�P�/����#
��������,l[rY2��	6n���M�R�<�������F2����T�BG�����!b��;���-k/5�#���1'����j�A�jJ����!
er�?�3o���/�m"h;�d�y'��DWCP����
���m�)��\91�.����4���X.�u�Ip�N�;3�����aZv<�,	������;>+�J�w���������p|�l�����so�>9n�73
,.Oh�N8#+�7���� �:&\��;�:1QY���x�������r7e�s�S��2����������Ld"���oVf��F��1�t�<�7�$�ex��4��4p�sQ�5�����|��;���m��"��V:	A�0�S7f�!og��7�$�w��L1Flp��G�rm�:����h��_ pt��������>����(����B�5�Wlj_������a�b� �79;K���;��/����M��u4��Z�OF_x1bD
8w��_��FP�@���������n���+x�Q��:�A1�����gW��^����|�B����C�a��B�?��-�`�?V{<������1�I��,�Z�+s��9�"1����I���=/�>
��T_�dd��(a���E�O����n��Q___�1w	 �k�2��	�V����W�|�����{�Q����*��\3���R����z?l*�a�R,?��%"\�`����HcWi�����r��X�[�{Ho�����J������XC8���Y��F��@%9�'�i]"K�Yy�g����\xr��|���>|�Uh_k?m��������5��zk��6��"���>
����(�:���r
T�2�rc�r���w�y���8'1��c��J�F��A��Z�2��|�.�s�@����S�|
�1��Iu
l���k���c�5��i���o��	���
 �Q.&
@��������7�m�9��=�����$Vu�(�6v\P�('����#F�U��������C>�>7D�H+���(�Z���m�=&�����!BjU���������H$�?cL�XS�F��#,�/������%>��eR���:/..�3���99$� �p�X�}w��}�&������!�6�)�����}���@R��[|���n~�� ��+7�(CH�.Z������v�
�9��%�[�E��+~�����;�-����N(6K�����, ��4;�w���J-x�'N�9g*��s���	>'���V����W��o��c����W��^�zv��8)z���j�������]6�:�oTb;;�V"����C;g�?V��t�������_J�3w�U������'_#������z�	:*���o�N��j�Mi�x��Q��T����t���)|�V%���G�����%4�I����tv �k����q`�!U�;�� ����|�Q@�rF��n����$�-;^r4�^b]#3C���X�x����
�[�����;3��$��`��O�/����c��9���X-�#����4*g�_��$tLP�@�D��`Dt*z�z���	���^o�����j/�`���QY���!(��[�P�����l��%��6vW�A�������[�xq�L�\�	=����n�z	����6�p,��(
.�9�x���W�!e�
^	��~�����}�&m�k��W�����Wk!@����N2T
G��������0qt���_PP�BB/U�����5/���6:D���C������x:3����{���&Q��P����	��H0�i���������w%.���m�d:��}�/]�I��8` }��!����������E^#���K����G<�!W���K�  ��c�g������.&5�y�E��i?�!x����j�����	���%�L������0�Lje�)��RMp=K��������}l�kz%��go���Q��n��R��F��������)$O���c�R`p��tm�����<zX#��"dId��v4x~�s48�F����EW�l�#H+_�%�7��{n���l������������|���g��������V;��Z�Gr_�n�T�e��c������������_���MG%%� �P��S�0�����*;/����+��������`g�������La|��Rz$r�V����~��QuQ�:h�E���
�=&��t`�p�s������{��Y�WXz������b�dU���j�&�|~dU5��
��l0M�K�����I��X���_��@�1`��HJk������t<$���&~�p�X���Rn��4(����d-M�P��j8���1�`8g���yu|o4�(A:�yL��]��(������@-E� �/����i�������Kw���3�#�Wp��E�\/KO�_|�Hf��V��`�f�fU��k��oq?
��+�������kJ��u?����������v�#�f)�>��s��1V����F�WB�?��	/�~�	,���G����2����'��N��U�\��&aD}�3��o#E��V�mKy�������v
�Z1f���kkIM������N�}/����K��5��/��p�o��o��l�,F�$4���q�_k�a���m�~�����"�yg����Y���r��/��H����!����U�&��6��9v�1
�*O��2�M��Zx+�?�V"c1��x�
�E|�]��������l����x�nx�����E��H��lhZe���}���X I*n�oq��A�g=�NjuS���k��>��C�T%WL��>��<�b��R�
��Wi���������=�;�FrfrAp5�O�����T�re�)	j����D)�aum��IRT��`qvt�\��?QP0g"��I\�Of�^���?����c(8�����C3��%�����c�V�j��e������D?�N�e�D?��M������\��J.�$P�^���v�s�c��G�j�g�!o	���b�L���G�,���Mg��������!Mt�v�FCa��V%F�xoi�g��b�
�yQ&a�Z���'m�a�t
vG�1��M��E���j�����>�YYY)�G��u������Z����u�B�Y'��u�����,��	4�����
�=�?������uo����s'��z����w��p��7s�=����G���7S�p(���5�'���w�}����#���8��lV��n�j<=v���d�K��+'�].j�+��=��9��N*G*�j�q����u{R
U�v^l����y�*-�-��7�y	���3_��y��f�^;�GBx%],'�z��s=\�����DEN���xz�N�
//�'�1[:�2}^T��
v{g�M�����-~��N�~�D�z����_Z�>u�o��Y9���q��u�kr�&��Vj�/F�)x���z�=xy��=8|��|��L���J�{���'������)� X����
=��F�F)��_J{�s8�~�����?m��TF�������%�bGy���T��'����G�Y.���~��>��w����>+c��G�bE"����7�)���riN(h-���{���HZ8����7Qa�lI�s=8�����z��*�op_)��0O�^s��E�������@��e��P^���8���}��sY�$n�$�7x�������OM��)`�t&�����tbw['��y��y����-_<,�T����;:i������������f�|���������|>+qJ�1��r�Fp�h�m�$�uC�x�J`��{&�`�h���so*��}*9��t��o���4!zC�w�h����:���y�8��<�A/����2�l/g��X���&y��jz�Q5Y�b�s[>z�N'U�,r����|O��=IU��U���N���[b��D8������*7��@�S��ZVi���S=LgDE

�1�.�-���IE��Z2���$5���K��l�]yS�G[������YT6����'���1a$��|x6�_M#��Nb��	�{���������e~����{�z�~���<��_��KTl�n�.�����m�z$��ox�����vF!D��9%��Sb�wV�����f�V����{E�'���������b���k�
-JB�����8�d��~a\�N��m]���6��k ��%c�hY���rHV�!�1(��\5���Pe$"����x������A�
�*�E��"���$�1�]b/
0��������Fc	 �����$�HZ��<N������� �jF�1����)���WZ��
�O�hB��`��e��<wsS�Vz9����;�����R�.{#�Pr��S����]I9�1�x�f$�G�={��I�u#W�o�{����a��q�}\�����,6�TJ��W��G�v:��M�W^c"T7�w��&��� ��p����@�4���q]r�-���<�4��n:���64��o�����D��pq�
<�G����������Z6_�0���[����ay~�u������Ui��������!������Cq�Qr2I��t�l������E
@M��xD���<�!��Cc&��!��6P�[A���rM��rk�7"y� �����V�����`w�����WS�J�m��z�J��o���|�����RvH����M�K�I��bjD�^��(Q3'�(�{�`��L��'ge��w;�T�6�a�[5��$�����^���S������1�9��ASvS�v�Z��k��L���:�a><�U��0���P(Jg�y�#�w�P�/�[o�l�������p�B��s�UR����(�����d������~oJ�����L!�v^z���9��T���>� ��nm��S������?5�_GC}��V2����?�k��hl�O�R
��s���N���5�&�����?�n���kD 9v���\�hlN��Nb��k]��nK<��GA�2}����Ib6�J���c8 ���`?8e8�'�)��3�^�\�m���;%����i�	�T��$��q�O��==���m�����|���������#k��L�CF���Oj�7Q�$u���k.wa� K�?L�X>���5��F�u��hX^5-�c��,��d�Q���m��$|���S���+��3i�����w\��o����W`r����Wf`����O��c������H�@Q�������0h� ��o����;:1��/#D?��b���
���l����w9�z�s�0�q���eb�w����	@�0�>{����se�Y�]�|����O)(�������v��'=��4F�(
Ui`�����E4��(�A%�k��m	~R�����[���4r�}�{�����v��q�t�%�
�m;P��U�l�3&8���k���I�����B��T������"��%�R��>	 �`u�����%�S!_YIq
��sc�������	���y�Skh���5����������=�;�N��0+�k#C�&r4zX�T����������a����_���[��Zs�'<U0�;)��9
��7m�xBGS�Z+����'��M�C�R�����b,6�(�Y�����dQ�A0����@*<D�[����#���C���"W����>Iv�+��l�6J
�� ����1������'a;�zo��b[�A��-m�2;��n
k�������]��|�gU3���apcXOJ�E��������8��tEM�!������zu��q��T*�1.h}����I
����V�� �������Dk~��-V{Uh�����6�A/v�^��x��Om#V��c��]��I�X���_��B2
iD5�$TL�X��:��4�7��SI Hx�<Ab7
SO�����DH��P������d��h���M�t5X%�+��j��0������n��������O%�6m��&����,I����o���[�*�HvY�HN�=��
��K>�Ik9G�������8?&��y6?a��M�����|!��	����g����y�(oAl�x}R���
�x'�{���w�(O�)8�J8��Y���b#������<=��k�Qa�GL���O��Y_a3�L9��+�^�,���.C��u��7���:�)�&S
!���Y�Zo\K��N��^v'Z�0����o�\����#�h8�Y?�8k|�_+fH�7bg���&�B��h������,�C��]�rK����"�P���
w@��&W:[��NP��D}j�������L���{&.v;��,�(k�C_�|\�'9�+����	E�M@��M/g����G�%4T�K�9�����R��������-G�/�C�u����}	�_�����$)�2vb�;�`���4~�$�8	T���-�P�mE�')�&;�i�s��]H��-���Ew�I����';��/o�n*$%D�0�����V�yB[@@Rl�����gv�,*�q+�X�6��3JrYq~��Mv����]F��+uRF�e
��{�-�4�o�.��/�D��s�1��F��<�@�	l"j}%�;����BP���x��B�
7^[���=��y�����xx�-�Oo�P��6�`6���:x4���7za���1�/,�R�3���w	��5��������
�?��&�.����<t�����Uf5�x�c|�]6zg&���YW�:n�+�BM�tw	1����/2N�
p'�"���C�����$R�*�y(��=���S� !0���m�.�B���^�gD�2�"��d��"�������.�n���iB���S��e�����fs��%�7����[F�D��Q�~c��}6m���e"y>�u�>��U���y�%f����<��|y�+
������9��Z4�+5��M�F�@��jJ1�-��h���L�m�`(����s�7z �:�t?�pK��<��W��9,[��s-e�@�..��%��2
��n�������Ce�-���^.���b1R9x�������"7iZ�T�����5�E <���w�[�h�\����hc�����'�����m��p��^>���9l���+=eyW;v	^Y4�V'�$��,n����LG�pZ\��.V��LP���D�Y�/d���>K0NO�������i�pC+��~'y�f��~��#@&UX�O��wT8KC���g��t���m�R1!Z6Y��S>5J��)P�{!C�����Q�dU��b����6����n����$���Y�|?���/Co�Ve��mIQ+-�~���	��n}h�T����6R���?-����U��?�`��	6��������.j*��O�Q��Zc�d�n
|�g����b�+���_M�2R���������g�	��G��<�����y|C*b���-�����5	�~+�W�s����}�	")�O3�=�~\���P+�x�v�z))�Y{FU��`�j��x��O��,p�6�-���;�e_����Xlq{��`�d��e�3d�<
�ie]@ELP[a��;�V�!Mm�c
�a p���]N/Ec�y���v}'�#�g���X��!���hP�'3�;,,�71\g�0�pM"��G��	�v/}���a�������p�>���D�YY��	��Cj�����VUq�|�Rv�w��k/p�������w��A{����MV��{������:�SEfy�g�����4;�D+=���O� �1��`�h�$��Q�$#�a�U��h��7y�
����\_���VC�\�Z;��+�����������H�,�
��1����=�$��FV�#��
��e�Ni�C��@>�f��
g=�?���H����!��C�m�/�{
���������3�����~'e��6;�D��dcU���6X���|����k�������Q���B�:m�����3��(��	��N��v
+x���
!��1��\���|����6L��i	�
����Nv=*��j�@�tOi_��Tg4�{���A���L��g�|���Q}uqQ������%����W32!��
_k����]��U�q��0�w,�M|'&\�.�����R�i�i���n���`R�_�3[�K�Qd���:e`S1�io���o_x����x3�	kdz��R[&v��PQ9�b�"�g���<1d���GB�0H���
E�B��k�	o������&_�=eDlu�����5��������vu�G�[�+d�?��
�=C";5}a��^��K,�f2s�-�����|#��i�	��lY��$O2���l�����K8OW���O��8f��!����2��:�������"+�{�x��W�k���:`t\���S�$0#��A���a������8��,����
+U���v
��[��UL����<w��N'�^P�Gl>�[NM����>]�g��r�pH3.H���gxu5�!&
�\���#����4Mw��9Ce]��=	��i����������3��8��>d�U�C=U���-�zt���e�Y���a0�!���vI��4'�E<�����"��[z�/�)���/��^!�2������[���:K�3a�"�>�E�����''��yE<��P
_���-S���xUT3qCNs�\���D<'��yw�����)�[�z����dl�w2Z�R�]��%��}�m5��rx�	��h�&��}�s�v[�����y�����Bt�JL�O��
���L4���)H>A!AP���������c#�9�
�`�E \T��K'y�r�����{G�t�G���0����������yH�5u�	�!)Y��+�#X���
�XU�%�/��4Xo�NG{����
�Fq6����yR��:l�ux�P�IX�t��+��{\�w��dqVK�jH����D������������N)����sC-Vf���~�n�������}�HGk��7��x�"r�~��PK�{8��E��V{"g�{J!��_f}���P��/�1yQ[��]Cj�mVR�e��
���:���8���5p���E�.RZ H�2e�����O�tIe�EE����`��h���H��b�E���D%�-B��Q�OS���DS~��}�$�t���hT��
�9KF#��%��dp�����'�C��O,���oRZ������������h(��
�4�g�L���d�n��L|�i8��V6��M��<��w�y�Q�������t����S���&���k��]K��_V^��(S@:���#��������YAG�������^�����������;�$��{a;[���s�'�	~�<eq�Sa�V��Y�#u7{���2M>o��x�xP��Xv��2
'�U�V���h�b�h/��'���2a�J\�_8��4]�Pz���<m��60�'���=����g����hx���*����(!�j�!��
�xc4�k�H�iY�<�b���S,�7�ox`}_�@�Z���|A~r<u�������}
	c�F1v�V8{�����gzTJ�v�#���:�p����$��Fu���f����E-�~:;����j�j_87���>��
�+�3!d��}h:��M!���T��[���w^����[Q$�&@b��6�A��"dQ��x�
�u�*�1�_��'�[�T�j���O�@H�p�<��)qe7Z��q�����8�����$4'�Aj��+[F�J����i���8��L���))��*2�Z������IB��fM�C���<_����W�P��RQ��'�7!�f��������=;>�1[.��d o�o�zQ��QWN����	��6q�A�
!���)N�fhHP)�4Lk�`������aX'�?!e:�J�����.4{����~���Y�?k��GP1������m�ql�\�QV��)�nc��Q�N��}w�\W�a8=����2�K&�&�]V�~|��R�}��\)�d������	zCn���x���(��Md%<��y�)��FCt8���0�0<�-����hDMp�/�������� n4���:�}�`.��9�
O.9����
���7�q4���TE���&\��n��,���di�M���������]�V���a�T������r`rIod�-�`Q�$k6c
cB@Z�����=�]�k��i���aEm�(����l�!<��>\J������#Sy�R��A��S�n�G�iBD|��
i ��-s�7�0�8���
�����������_���u�D�0y��+s�^=DXS��:w������(�����X��s�<+��F���1M'E&��1����������������:��V�.��%]m��8�������&G!�1*j���k��W�5J+B��}�NmF	��0���~!9�E����qQ�lu�U����4Wp]�	[S1���Z�--�w8��W��2xM��yx/���r����a@rb�<�$E�����i�$���8�$��������m{]0�}^n�|���N��!0[���z.�u&��	��Z��:��K���g�������1{�[J�kM�����%��?H�V���j:���<l`U���CR����UnB?� ��o���[kk��l%�-���o�6s�&s��6���>qf�i�o�4q�	3e�����@��BX����L7�����I��k6����N���A��j6m����/Zh`����"$Jl9>��
h�]���e9�O��o��]��bfZ��o���P!����<���]� �+O���������=jXz����e����->� p3�
�HG��I%�t�^{���-K��"�R���h`���%��'���D#�o��d�����.t{ZM�f���A���|Cf$yh��N^^������q<�f�<��<A*C�/�z�{�CS:�j��~":Ss[arItO�������4Jp~@�*V�;����^]�u�{�&�Z����(����o	M+����3����e��������t$rG�[B��I��?�`N���r���E's����������7\��<Did���G�c">u����� �+���,�fAM{:k����6Y����[}'�B~5����~%��
-�o:M,�"�Z&9bc�z�1[$Lar'j84`l�A*T��K��a	=���eZ�t~�����Y�qQ������*�^!bBuwA����2M�j�@�r�!&iRVg���Y`T�|9���x'!�y��d��s�EW:f���`7���C���G��Y��aL��ac\@5��2T^b4�x�����e���5��7x�����N4��&�*D~�B��l7?��A���A�F�nmj�V���q��Q2<c�^/08�0�^��tpm���b�2Dy���l��F��<X����J5��g
u{H�O�r<J)c��<��$��r|fn �
�D[+�1"����4����WqO3� '@H|G�:��U
.)04���w�F3n.���8�.� qMs���������������*F$iS�&u�KA�n�\�uQ�N��p��uFgm���X���~2�����..g��u���:��
��l�}b�������a�`�����_��w��E|��4B�I_��Z�;�UD,������%�eS�8:��09�x����d��cy�x����{'u�y�]�?d��X�E�s���|T�]�:�0���u��]����9`~�k���1�s�C�����7���n�.��p<����������f(�O)R����bQ)�r�lc��/�/{�N�f�"�B�{���y�� ?h�=�3D�c��,��t�'s�
�$�*�J�X�����W�����/"w^�E:�M�d/�������4��z��DT/����@ �����`������_$�|������T�?xRCY�9���(K��e��p��;�/6��w���(<t������>�3R�9�����"��+����-'���ZE���J�?v?=�<�|7[g	oxi@p��f������=�l_R��QP�������x|�O�?��w��X������
���\��
��9���L��g�ct�3�w��)G����C����|s��� ��w���9�7F��D���p2���{�`N��f�)���4��^�${���,��G��T���h�J;'-�����=����oL�W!�J2A�k�1#F1��F�c�H�����T,�)�Z�_��9�qH)l��,G�xGQ"n*�ZUU�r68\2����Ba{C9�
�n,6g>{fR$����h%�N�B���3�;����A��qYfc�
�q�
N�1�.y��� 5��}
6__:����M�V���'8��}��o!-���1�I$�-dP'd��Z(�0:|����+����T���K��Z.u�~K�K��t4����m�P0��R���_&��x6�8�CDH�'J�����B"�Mql�!�n0��������I�r�?�&�������B4�KN�Z����	�^A�j�w���#Oyt��|�J&�p�0�gj1RQ�P4��B���k2�0�c�:��0 �+#n��2�]�K�0�h�Q8��iv�2=���pj�H�v��;�CY�07�?���m�3q����hB�$�{=���Tn���x��[K�����G���<JD0��T�we����e����[�Fh:��{�p&jQ�I��&�j8n�sp n�Q�[�>�%t'�����G��JFW�"�l;A;���!��h�Dp@Ta����������K�n$�{�O�]us&(�����)��wV.6��J2_$�
��������b���l�A�"ks�=��o0*����CI4n�#����4��0�����]��#
m2r.�_��M��!f���.����D���M�k{���������,��\�(�|}C��Ibjk\�q\i���Bs���wT�������+��1S�:He�������Y]]-��)�b����a��o����%�|t���|���hgo�u���w�p{�(:/}A<6��W�W�G����������nP��g�����������#W��+v3/��6+�����s�nX
��ov���B>�/�w��
~r�w�qgU(4��?;�#�Go���y��5x#o7�uxg���|��
8�0�0m��p����&��PN�n���>ee6r��9C@_]��2M����#�u9(��������/(��������(�������`g����������P����� �_;�f�T�[a������:@�B;�i	2T��{S�O'�0�Gx��]����� {>l\!���[�x�k�����
��sn�	 ����c�QG$�2e����5gFN�H��������,�U������l�A�Kj�5�>���6(3��J��9���������������&C(�DY4YH�5���X��q6%�]��uC�����o�_��	��g��}X�����+����'�>����d��Iq�����p�(��_Jw!1T%7,�h���x�������d����������W����������\�=6�l���-���x�o K�;���Cj6��F�]Ne�k�,"n�_T��j
��o�_����$�pT��8�$��:�U��%�{��B59_�J}#.g�{����Nh/������nl����#Y�/����jo��^�h]�W��V�8������Q���l���c�������)z	��|W��4a����K�@@d|��o�����J����Sk��'J���3��P]�h�j���=!��;��5UA-|�+��[�����'z���������mQ�4.�v��,=��,@��;?|0�'�I���M�As�?#8U*����g�c�RB8��)#'�9i2A$���u+��"NlS� x���(~���jTo*1��x+�m3�,�e�?�D?��$?#���v�{�_�*��$�lx-y.��o=�����7�������On
D��$����D��'�����0�Q�kH��u?�zp����t<��v����W^]^B����	��p�;�ng�{	���s^i�\6c93#�Aw�e}�q���	^�a �k�;w<|������s������'�z ��F�fq5��yW�x�;�h��a�y(�����N\B)��D��@��F���g6.����E�	�[�`eS�$&/������nj�4n(��������'������^m�m�W:�-� ��:(t`���5<�)��y�$��br%.��~�o{,%! ��$�������l�`��������*���}��@�/Gu���7v2��dg���g\�C�P���zV���	�$&���7?���w,������v���T�a�6��J'?]���
5�^�Q��6��MD�@�WM�6����aUm�k���h�#6p��������|��vY��.����m���>�5%��H�#�/�A3O�sP��"���������^T��m=��%�R�2`+��{�5}���v���R����V�����I��?x�(���B�f�����/�K����5 �36��F�S�gso�ZQ1�0K��1�z���,�����Xm���v��j�WI
� H	��vx�"�.%���EB_w���s	�O����fw�����d���#����y�\���)�BY��������%��Am3j��;+U�o ������_=�<
�6G������K?E�j��*��[5���k`wAdT�&0�\k�����h4���&h�Q�y�_�L}9<)�����<~	wl�I���A�[]���5�>b��D^�����>��	:TZ&W�A_Y�������0��L�����"[�?}�%�C�
�3�g����O}���I�>=+����bx��U�z���o�Rr�:���~E��b�I�Gy������)o�v�<V��&����������=��>q��M��8�^��C�qw��7w{�]v������a�s�g��tD�������I/���h��m�XH��c�Z���������8���:\�2+$d��F��Pd�����/�A5���TV]�4�����������:<z�vg�E��vw�%�yU���Sa�+9�)�J��H;�>q���s���|?D����
0����;}�9dI*�7��nx��8b~��^��1���|y���'��X����xDLt�J )-\+�*to8����P�x0YjO�OB������J(���1���o��?��[C>�-�D���O����M�����W���������>>��ty��������[y8�#~S"�#����#�U8���\>ij�2��F\6�V���0�m�W^�kV	��~��������''���>�����������Pm}�{��-��Nj8=���O����F`�n�E�W���q�-e��M���)j��x'������a�p^o��������.(��ov~v�6�s��u/��e,{i��B�Q���
J:��������W��_�u@>[�T����`N��8n/m�T����������	W�����Vl�+��n%��B3�����]�D��c��bY�e�����%*�
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#20)
Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I couldn't resist looking into this, and came up with the attached
patch. I tested this with:

CREATE TABLE words (word text);
COPY words FROM '/usr/share/dict/words';
CREATE INDEX i_words ON words USING gist (word gist_trgm_ops);

And then ran "REINDEX INDEX i_words" a few times with and without the
patch. Without the patch, reindex takes about 4.7 seconds. With the
patch, 3.7 seconds. That's a worthwhile gain on its own, but becomes
even more important with Alexander's fast GiST build patch, which calls
the penalty function more.

I tested this on HPPA (big-endian, picky about alignment) to verify that
you had that code path right. It's right, but on that platform the
speedup in the "reindex dict/words" test is only about 6%. I'm afraid
that the benefit is a lot more machine-specific than we could wish.

I tweaked the patch a bit further (don't pessimize the boundary case
where there's exactly 4n+1 trigrams, avoid forcing trg1 into memory,
improve the comment) and attach that version below. This is a little
bit faster but I still wonder if it's worth the price of adding an
obscure dependency on the header size.

I used the attached showsign-debug.patch to verify that the patched
makesign function produces the same results as the existing code. I
haven't tested the big-endian code, however.

That didn't look terribly convincing. I added a direct validation that
old and new code give the same results, a la

if (ISARRKEY(newval))
{
BITVEC sign;
+ BITVEC osign;

  		makesign(sign, newval);
+ 		origmakesign(osign, newval);
+ 		Assert(memcmp(sign, osign, sizeof(BITVEC)) == 0);

if (ISALLTRUE(origval))
*penalty = ((float) (SIGLENBIT - sizebitvec(sign))) / (float) (SIGLENBIT + 1);

and ran the regression tests and the dict/words example with that.

regards, tom lane

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#34)
Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

I wrote:

I tweaked the patch a bit further (don't pessimize the boundary case
where there's exactly 4n+1 trigrams, avoid forcing trg1 into memory,
improve the comment) and attach that version below. This is a little
bit faster but I still wonder if it's worth the price of adding an
obscure dependency on the header size.

It occurred to me that we could very easily remove that objection by
making the code dynamically detect when it's reached a suitably aligned
trigram. The attached version of the patch does it that way. It seems
to be a percent or so slower than my previous version, but I think that
making the function robust against header changes is probably well worth
that price.

BTW, I also tried wrapping the first two loops in an "if (len > 4)"
test, reasoning that the added complexity is useless unless the main
loop will be able to iterate at least once, and surely most words are
less than 15 bytes long. While this did indeed make back the extra
percent on my HPPA box, it made things a percent or so slower yet on my
Intel box with gcc 4.4.5. I think the compiler must be getting confused
about what to optimize. So I left that out of this version of the
patch, but perhaps it deserves closer investigation.

regards, tom lane

#36Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#35)
1 attachment(s)
Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

On 02.07.2011 21:07, Tom Lane wrote:

I wrote:

I tweaked the patch a bit further (don't pessimize the boundary case
where there's exactly 4n+1 trigrams, avoid forcing trg1 into memory,
improve the comment) and attach that version below. This is a little
bit faster but I still wonder if it's worth the price of adding an
obscure dependency on the header size.

It occurred to me that we could very easily remove that objection by
making the code dynamically detect when it's reached a suitably aligned
trigram. The attached version of the patch does it that way. It seems
to be a percent or so slower than my previous version, but I think that
making the function robust against header changes is probably well worth
that price.

Ah, that opens the door to do something I considered earlier but
rejected because of alignment: instead of three 32-bit word fetches, we
could fetch one 64-bit word and 32-bit word. Might shave a few more
cycles...

Meanwhile, I experimented with optimizing the other part of the loop:
the HASH() macros for setting the right bits in the signature. With the
default compile-time settings, the signature array is 95 bits.
Currently, it's stored in a BITVEC, which is a byte array, but we could
store it in two 64-bit integers, which makes it possible to write SETBIT
differently. I experimented with a few approaches, first was essentially
this:

+ /* Set the nth bit in the signature, in s1 or s2 */
+ #define HASH_S(h) \
+ 	do {									\
+ 		unsigned int n = HASHVAL(h);		\
+ 		if (((uint64) (n)) < 64)			\
+ 			s1 |= (uint64) 1<<(n);			\
+ 		else								\
+ 			s2 |= (uint64) 1<<((n) - 64);	\
+ 	} while(0)

That was a bit faster on my x64 laptop, but slightly slower on my ia64
HP-UX box. My second try was to use lookup tables, patch attached. That
was yet faster on x64, and a small win on the ia64 box too. I'm not sure
it's worth the added code complexity, though.

Here's a summary of the timings I got with different versions:

ia64 HP-UX (anole):

unpatched: 11194.038 ms
fast_makesign-tom: 10064.980 ms
fast_makesign-2int: 10649.726 ms
fast_makesign-tbl: 9951.547 ms

x64 laptop:

unpatched: 4595,209 ms
fast_makesign-tom: 3346,548 ms
fast_makesign-2int: 3102,874 ms
fast_makesign-tbl: 2997,854 ms

I used the same "REINDEX INDEX i_words" test I used earlier, repeated
each run a couple of times, and took the lowest number.
fast_makesign-tom is the first patch you posted, I haven't tested your
latest one. fast_makesign-2int is with the HASH_S() macro above, and
has_makesign-tbl is with the attached patch.

PS. in case you wonder why the HP-UX box is so much slower than my
laptop; this box isn't really meant for performance testing. It's just
something I happen to have access to, I think it's a virtual machine of
some sort. The numbers are very repeatable, however.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

fast_makesign-tbl.patchtext/x-diff; name=fast_makesign-tbl.patchDownload
*** a/contrib/pg_trgm/trgm_gist.c
--- b/contrib/pg_trgm/trgm_gist.c
***************
*** 82,88 **** gtrgm_out(PG_FUNCTION_ARGS)
  }
  
  static void
! makesign(BITVECP sign, TRGM *a)
  {
  	int4		k,
  				len = ARRNELEM(a);
--- 82,88 ----
  }
  
  static void
! origmakesign(BITVECP sign, TRGM *a)
  {
  	int4		k,
  				len = ARRNELEM(a);
***************
*** 98,103 **** makesign(BITVECP sign, TRGM *a)
--- 98,316 ----
  	}
  }
  
+ /*
+  * Lookup tables for setting nth bit in a 128-bit bit array, where the
+  * bit array is stored in two 64-bit integers. That representation is used
+  * inside makesign(), elsewhere the signature is stored in a BITVEC, ie.
+  * a byte array, in little-endian order. To be compatible with the byte
+  * array representation, the 64-bit integers are in little-endian order
+  * regardless of the native endianness of the platform.
+  */
+ #ifdef WORDS_BIGENDIAN
+ #define SET_NTH_BIT_LOOKUP						\
+ 	1LL<<0x38, 1LL<<0x39, 1LL<<0x3A, 1LL<<0x3B, \
+ 	1LL<<0x3C, 1LL<<0x3D, 1LL<<0x3E, 1LL<<0x3F, \
+ 												\
+ 	1LL<<0x30, 1LL<<0x31, 1LL<<0x32, 1LL<<0x33, \
+ 	1LL<<0x34, 1LL<<0x35, 1LL<<0x36, 1LL<<0x37, \
+ 												\
+ 	1LL<<0x28, 1LL<<0x29, 1LL<<0x2A, 1LL<<0x2B, \
+ 	1LL<<0x2C, 1LL<<0x2D, 1LL<<0x2E, 1LL<<0x2F, \
+ 												\
+ 	1LL<<0x20, 1LL<<0x21, 1LL<<0x22, 1LL<<0x23, \
+ 	1LL<<0x24, 1LL<<0x25, 1LL<<0x26, 1LL<<0x27, \
+ 												\
+ 	1LL<<0x18, 1LL<<0x19, 1LL<<0x1A, 1LL<<0x1B, \
+ 	1LL<<0x1C, 1LL<<0x1D, 1LL<<0x1E, 1LL<<0x1F, \
+ 												\
+ 	1LL<<0x10, 1LL<<0x11, 1LL<<0x12, 1LL<<0x13, \
+ 	1LL<<0x14, 1LL<<0x15, 1LL<<0x16, 1LL<<0x17, \
+ 												\
+ 	1LL<<0x08, 1LL<<0x09, 1LL<<0x0A, 1LL<<0x0B, \
+ 	1LL<<0x0C, 1LL<<0x0D, 1LL<<0x0E, 1LL<<0x0F, \
+ 												\
+ 	1LL<<0x00, 1LL<<0x01, 1LL<<0x02, 1LL<<0x03, \
+ 	1LL<<0x04, 1LL<<0x05, 1LL<<0x06, 1LL<<0x07
+ #else
+ #define SET_NTH_BIT_LOOKUP						\
+ 	1LL<<0x00, 1LL<<0x01, 1LL<<0x02, 1LL<<0x03, \
+ 	1LL<<0x04, 1LL<<0x05, 1LL<<0x06, 1LL<<0x07, \
+ 												\
+ 	1LL<<0x08, 1LL<<0x09, 1LL<<0x0A, 1LL<<0x0B, \
+ 	1LL<<0x0C, 1LL<<0x0D, 1LL<<0x0E, 1LL<<0x0F, \
+ 												\
+ 	1LL<<0x10, 1LL<<0x11, 1LL<<0x12, 1LL<<0x13, \
+ 	1LL<<0x14, 1LL<<0x15, 1LL<<0x16, 1LL<<0x17, \
+ 												\
+ 	1LL<<0x18, 1LL<<0x19, 1LL<<0x1A, 1LL<<0x1B, \
+ 	1LL<<0x1C, 1LL<<0x1D, 1LL<<0x1E, 1LL<<0x1F, \
+ 												\
+ 	1LL<<0x20, 1LL<<0x21, 1LL<<0x22, 1LL<<0x23, \
+ 	1LL<<0x24, 1LL<<0x25, 1LL<<0x26, 1LL<<0x27, \
+ 												\
+ 	1LL<<0x28, 1LL<<0x29, 1LL<<0x2A, 1LL<<0x2B, \
+ 	1LL<<0x2C, 1LL<<0x2D, 1LL<<0x2E, 1LL<<0x2F, \
+ 												\
+ 	1LL<<0x30, 1LL<<0x31, 1LL<<0x32, 1LL<<0x33, \
+ 	1LL<<0x34, 1LL<<0x35, 1LL<<0x36, 1LL<<0x37, \
+ 												\
+ 	1LL<<0x38, 1LL<<0x39, 1LL<<0x3A, 1LL<<0x3B, \
+ 	1LL<<0x3C, 1LL<<0x3D, 1LL<<0x3E, 1LL<<0x3F
+ #endif
+ 
+ /* Lookup tables for setting the low and high 64-bits of the signature. */
+ static const uint64 lowsbox[128] =
+ {
+ 	/* 0-63 */
+ 	SET_NTH_BIT_LOOKUP,
+ 
+ 	/* 64-SIGLENBIT are zeros */
+ 	0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,  0, 0, 0, 0,
+ 	0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,  0, 0, 0, 0,
+ 
+ 	0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,  0, 0, 0, 0,
+ 	0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,  0, 0, 0, 0
+ };
+ 
+ static const uint64 highsbox[128] =
+ {
+ 	/* 0-63 are zeros. */
+ 	0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,  0, 0, 0, 0,
+ 	0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,  0, 0, 0, 0,
+ 
+ 	0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,  0, 0, 0, 0,
+ 	0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,  0, 0, 0, 0,
+ 
+ 	/* 64-SIGLENBIT */
+ 	SET_NTH_BIT_LOOKUP,
+ };
+ 
+ /* Set the nth bit in the signature, stored in two uint64s. */
+ #define SETBIT_S(high, low, hash)		\
+ 	do {								\
+ 		unsigned int n = (hash);		\
+ 		high |= highsbox[n];			\
+ 		low |= lowsbox[n];				\
+ 	} while(0)
+ 
+ static void
+ makesign(BITVECP sign, TRGM *a)
+ {
+ 	int4		len = ARRNELEM(a);
+ 	trgm	   *ptr = GETARR(a);
+ 	char	   *p;
+ 	char	   *endptr;
+ 	uint32		w1,
+ 				w2,
+ 				w3;
+ 	uint32		trg0 = 0,
+ 				trg1,
+ 				trg2,
+ 				trg3,
+ 				trg4;
+ 	uint32	   *p32;
+ 
+ 	/*
+ 	 * s1 and s2 contain the signature we're calcuting. Together they
+ 	 * form one bit-array of 128 bits.
+ 	 */
+ 	uint64		highsig = 0;
+ 	uint64		lowsig = 0;
+ 
+ 	SETBIT_S(highsig, lowsig, SIGLENBIT);	/* set last unused bit */
+ 
+ 	if (len <= 0)
+ 		goto end;
+ 
+ 	/*----------
+ 	 * We have to extract each trigram into a uint32, and calculate the HASH.
+ 	 * This would be a lot easier if the trigrams were aligned on 4-byte
+ 	 * boundaries, but they're not.  The simple way would be to copy each
+ 	 * trigram byte-by-byte, but that is quite slow, and this function is a
+ 	 * hotspot in penalty calculations.
+ 	 *
+ 	 * The first trigram in the array doesn't begin at a 4-byte boundary, as
+ 	 * the flags byte comes first; but the next one does.  So we fetch the
+ 	 * first trigram as a special case, and after that each four trigrams fall
+ 	 * onto 4-byte words like this:
+ 	 *
+ 	 *  w1   w2   w3
+ 	 * AAAB BBCC CDDD
+ 	 *
+ 	 * As long as there's at least four trigrams left to process, we fetch
+ 	 * the next three words and extract the trigrams from them with bit
+ 	 * operations, per the above diagram.  The last few trigrams are handled
+ 	 * one at a time with byte-by-byte fetching.
+ 	 *
+ 	 * Note that this code yields different results on big-endian and
+ 	 * little-endian machines, because the bytes of each trigram are loaded
+ 	 * into a uint32 in memory order and left-justified.  That's probably
+ 	 * undesirable, but changing this behavior would break existing indexes.
+ 	 *----------
+ 	 */
+ 	endptr = (char *) (ptr + len);
+ 	p32 = (uint32 *) (((char *) ptr) - 1);
+ 
+ 	/* Fetch and extract the initial word */
+ 	w1 = *(p32++);
+ #ifdef WORDS_BIGENDIAN
+ 	trg1 = w1 << 8;
+ #else
+ 	trg1 = w1 >> 8;
+ #endif
+ 	SETBIT_S(highsig, lowsig, HASHVAL(trg1));
+ 
+ 	while ((char *) p32 <= endptr - 3 * sizeof(uint32))
+ 	{
+ 		w1 = *(p32++);
+ 		w2 = *(p32++);
+ 		w3 = *(p32++);
+ 
+ #ifdef WORDS_BIGENDIAN
+ 		trg1 = w1 & 0xFFFFFF00;
+ 		trg2 = (w1 << 24) | ((w2 & 0xFFFF0000) >> 8);
+ 		trg3 = ((w2 & 0x0000FFFF) << 16) | ((w3 & 0xFF000000) >> 16);
+ 		trg4 = w3 << 8;
+ #else
+ 		trg1 = w1 & 0x00FFFFFF;
+ 		trg2 = (w1 >> 24) | ((w2 & 0x0000FFFF) << 8);
+ 		trg3 = ((w2 & 0xFFFF0000) >> 16) | ((w3 & 0x000000FF) << 16);
+ 		trg4 = w3 >> 8;
+ #endif
+ 
+ 		SETBIT_S(highsig, lowsig, HASHVAL(trg1));
+ 		SETBIT_S(highsig, lowsig, HASHVAL(trg2));
+ 		SETBIT_S(highsig, lowsig, HASHVAL(trg3));
+ 		SETBIT_S(highsig, lowsig, HASHVAL(trg4));
+ 	}
+ 
+ 	/* Handle the remaining 0-3 trigrams the slow way */
+ 	p = (char *) p32;
+ 	while (p < endptr)
+ 	{
+ 		CPTRGM(((char *) &trg0), p);
+ 		SETBIT_S(highsig, lowsig, HASHVAL(trg0));
+ 		p += 3;
+ 	}
+ 
+ end:
+ 	memcpy((char  *) sign, &lowsig, sizeof(uint64));
+ 	memcpy(((char *) sign) + sizeof(uint64), &highsig, sizeof(uint64));
+ 
+ #ifdef VERIFY_SIG
+ 	{
+ 		BITVEC osign;
+ 		origmakesign(osign, a);
+ 
+ 		if (memcmp(sign, osign, sizeof(BITVEC)) != 0)
+ 		{
+ 			elog(LOG, "mismatch: %lx %lx", highsig, lowsig);
+ 			elog(LOG, "orig: %lx %lx", ((uint64 *)osign)[0], ((uint64 *)osign)[1]);
+ 		}
+ 	}
+ #endif
+ }
+ 
  Datum
  gtrgm_compress(PG_FUNCTION_ARGS)
  {
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#36)
Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Ah, that opens the door to do something I considered earlier but
rejected because of alignment: instead of three 32-bit word fetches, we
could fetch one 64-bit word and 32-bit word. Might shave a few more
cycles...

Hm ... I suspect that might be a small win on natively-64-bit machines,
but almost certainly a loss on 32-bitters.

Meanwhile, I experimented with optimizing the other part of the loop:
the HASH() macros for setting the right bits in the signature.

Yeah, I was eyeing that too, but I'm hesitant to hard-wire assumptions
about the size of the signature.

regards, tom lane

#38Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#33)
1 attachment(s)
Re: WIP: Fast GiST index build

New version of patch with readme and some bugfixes.
Some new tests with fast build parameters variations are on the wiki page.
While I doubt to interpret some results of tests, because it looks strange
for me. I particular I can't explain why decrease of buffer size affects
index quality so much (in my expectation decrease of buffer size should make
all build parameters closer to regular build). I'm going to recheck my
experiments, probably I'm missing something.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.5.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.5.0.patch.gzDownload
���Ngist_fast_build-0.5.0.patch�<iw�8��������BI���8���D�����I�����IH�3EhxXV���}�@R�����jG"�B�P�`7�M�v�����������y2I:��LT��e��i����}�j����7��������[��nx����+
����z�&���������Z�	Q�}q�'�3�p�R�>�����
�$����_s�����e">Q���Okb(�4���X�� o���"_>���������������������L��j"���*��y.rd���I�!�*�l�MA�*aP�H�he�M*E>�_H0���a�SO��Ma�f>�z��}��_�+Wqk������ZG �_\���=g�h?_��o/No����hp}
������#6��i>��>��2LR9������~�����W��^��'i�~M�>��7���X��#��y�>�a���]��g��"U�;����D�����f ��^/�5hP3@{�{7h��~��-`A
��~�4��S�F�����l�oF�'�����^����Xy`�����,m�!l���k����������{-����KU������WdJUI����}�5K�e`L��L�DN���N��$�NU�&����wg���M\y|0�����F��]O��2O�R�������I_z�Q�s3�x\���'`�FA4�D�K���
G*���9��0}��x�������F�:���=1s��
SSh�w�V�K����}Y;%��w2+?!�W�)m�����M�n�bgq;[8I������;[��>s��{`;Tq�3gw������X��i:�����f��
��$k'Y�&r�w�<��D2�v�ZZ&���"$d�oO�a{������4gZ�j~�u1^���:��T� ��<r'����4��%z.x��x$�c�f�xDq��G��F���f2VHQ���p����T��)HZ���X~yA*)��6�)pM�Y�1�3_y_��P���}�lm�����������z,�'�ok=�O�	p���C@�0gN"f�����a���b���s�ZH����>�����NQ�� ��p��;
��%p]>�(��������Z��������Y`I�BA$�s%2&i��������GQ���t�>k�p�����t�w�������~��[�2uD��c"���"�����A��2�p������d���I��!Y"\) .�y��-�L���6��Zp�q������
�BA�9��[D��^��t���� ���;	���&����O2Vmq����XU�(�Gq�=h�� ����:�|0I��X2��L�G��n����~F��0��e([i0�B���o�T��O��p�V��b2�NU3}�Z��?��	 ���}"@�3t_~�[��V~��v�3��i��{����-�rhHY^Bs+����e:��E�M����h��mX-��E B�R�h�6S�$)52�D�@�6^G)��e[��lj���;i��W'[����(��P�#@(]��O�=��B�P�D����(%@��R
�?���Y�_��z,+��v^���W�S
U�h��,R�U������
�U��6(����������ii	�'(��@@��mq$1��6�Z��%~��3vY��0��>x����H��#*9�p!b�����x�:	r��V���b�������M	)V�\�3w�
l�!�J�R�����
���h��71���d#�rvM �p���36�����=�H���h��o����I��hV���
��1���07�>��M ����;�J&�F�6������\UB.bv�6��E�5�����K���l;�iCXI�m^2����`Rc�I����
��4����O�F��B���B��[y��XM��3�����@�����_�h����>q�{�06H�)��iB��(�K$c�q�WJ�4=1{d8,��������#�z��t������=�e�k��$,����2�P�
n���:0C��Qe�]��� ���W��d<
��`�^l���:V��<Ga3�����<�v��T��
1��MAy��a����bPZ3+I�sn��EUX�Paq��h�e���HcTU�,�w<VBYF�2�z���S�|�@��`e��S��|W�<�~w�"����b,�O����&�F��~;-skYUr������N)X����5��0d��iQQ��`�W�p�������8��5H8/?�e2UQ�����P7y�P� ns�U�{�q��CB��x5kR(����vqDi1\����1����a��)�{+�k�:H����9fE��B��Pv�0D��=Cq�1&
4���K��B��oq�����jn3��8Om�����	�4���
�n\�Q���I^���h����@����5\K���������0)�km=`4�UK��g:���9��
t"����&�m;�`���XMU�s�V��H�a�S,��
���e.�C�X=��tFw�m����h���<7"���@me�����j���=�����;1��F�c�>�	�qv��&���5�M
�2s�J�����XEE�Iz���R���k�
��4e?=Q�K�yq���(�l@L��$��,��Z�s���Y,��RT�P�����
i��EL���C�m��\�0uBd���*��4��y>(��!_]��D�+A��m��R����I[��%�e{����H5s�wXf�����m�V�f 1��k�|��`�����R����n�q��+�T������"Pea{�����6�of�l-IU��!��I{�Vj����F�.�����4�?�������s����8�������]_�����_
��o���m"�c|*\���lB��i'��W����������3�F��?���O��E ��H��A
S�OE���2��`�;��� j�����?�[/�D�L`����j .�@:�����/��K^�[�]��gP3��c�d����'�������-��FCu��=Ia�x�.�<��@y����A��xT��/��^�z_���ur>B����hc`z���M��h>�a&��1�iI����SZ�nrB��79MZ��q��bc���"�#���
��'�J����IXaM�4����-\��46��&�)b�r[$�&�-L8�����s��Y,���4y�����z8Ldz��*�!����u��a��AV0����Q5:ES��+`�� �46���8;���o4V�o��s6�#��-y�����'����R�����=��� 4M������Cim 
�@�����h���16�m!^�I�Q���l�Cpwx�&4�������N���r���p����Dq�s��-L60� v��� 2E��T�)��X������C��'�����o��]��:t���N�on��cZ�y'�Z���>o.}ws��\���d��>O}��w"�G{
}��fU��Bw��}�d!HAL��x����>/g[�~��R���K!B��7,������Ku�l%^
��0���B4�Qf�YB[|�<��eRY��5A�L��w���o���C��0g�c����!�yH�4bx�
�������u�kL����H>��D�6�s�'���X�\^A��h�l��[��#)1C��FAj=k}=�{������{A��.���0R���+9����hj�	E� �0����!NN������w�{?^]s_�D��(	����Qpw���n/H����mY�j���#.�z����?0�y����k|M>����������F}���1��7MYt�c<=%b\S��Fwm�y��yc��h����*���t�|��!d6'tT
��� �W�@vE��~��:W$
�g�������	����$���;���
���/���u��/>:b�Pj����w{Ey�#k�u���VL��D���w��~A�)�����T1z�z��#�>����[�����������r��R�y�[�b�^�H5��S�����m.2v���0#���)�z&q�*��C��[6��	��T�!���V�(����jg\�������Q���v�8�0)��:�k����r�$7�.��)�#��|�Z�4iR���Z�U*��P���R�s����,bLM�b(K��iGqAf�%��K�N��.`MC���V�E�\�K�N�����[
M?�C�\{�b��e];#�q�0���nd���q[>���������#��TZ[�w"�f�6\�������B��nE��;]��6��%��Y��@d��jA8A�A�'6���0�����i����#���l���-�w��b$������������	�9^
Q��e��6�8*L�.���j�n�.)W!�2��=s{l�7�q�� (Jb��q�x �+�,�g���?�9W��d��hB����������Et"��� 
��U��6F��T��=S���P���l�V��B_x"4�&�J`JL��/}@���B����K5@7�����x����1�B���+��-3�c�����LT�S��f2���)��8�T2��P�)F�7q(w	��k@�����{���n�wyz6��7��V�U�bk�o_���e������X����]���e�]��5lTs�7�M���i�K��l��Zyz lX��"��v����c����!���[�Z���}}��]��([���c������l(����@����j���
[��@���ssXB�\�(�y!����N�c�E<9�@�i3'WD�����o�lo:[�{����y��o��L��lR�Q�/*�8L>���57I�	���|H�Q=�>����q�BO�/�oQ����"U��]7�����c���P�(9S����Tw]��V�L��!h�����?&n��m;�{����a����J!��s�w
��{A�?�tv�w��!��",�,���`a����5��f�-����R#�Y��l�X�&lU�>��0��9]�D'_����-�`���a"�,As"�sc!p#C�=_����0��l���^�Y�{c�Yg����W&���Y@�:D��,u�R���������4�FY�e������Exj4�u3���f������W|)��d6��Z�$)p _�?i�Z��T���`��T%������+��-���k��[��]���k���e,��S.d5)��o�������A��Pt�J��P�D7D�~{�_���t�7������O�h��8���#���%O�s5�$�%�7��MSY�hX��WF��l
7[�]����l6,��
�f�@�����TY��u���<������O��%��`���A�-���0|��oa`�y�;������� 
���X_�/N�t��$P����#�h���3�V�W[N����I	�g�f�|��c�J.��9m����
��G�P�re�s�h��s#�
�bH �x���v��&Ax	�����B��"����kzV
J�~���0��:}1��_���^H��B�7�W�����$�ZPQ�����$�:0-6=�������X�7�����:���J/�!BKz�dr96��g�z����f`
�s�.�<�_��*p��X��FY���yV{B�M6�)������j�*I���x���z����~��8&���|^���,�b�Z�y,C��h���P���E���o�j����Y�+��3	��N���9�^�If���RmI��L8����u_U�����.������z��u�>>��r�/�G�b<
�<�zS�q��m~�~��m�?�����wNk�����������|f�w:��GWB��n�����piVd�����f���j?{��E�g�S$��������|�6��[�'�wJ�D�K����6���9�W{��;�U��_�����a����46�������
��-4���-s�-���jb<�u�4�
\rN����"����F�z��$�ds��kN�[����O�({?�=�Q��� �-;u����a?��h�����Z���%���7�?����^�31r�Y���o������O[�3��N�H/�3X�g_b�A�0��'���e�1G�"����xc'�r@C�?.�x
mBp�k<����?�j���	�D6H�D����Z��<��:���<��������h���vj��s���msm��������r!����������=�^��>� �Q��5kV%&#��
����U�z�:�z di�35K���S����b&��f I[O�������wj����j�)�	������A����5������a�m�Z������������^��>�6L���8?��2���H��� ���1K��9$�t<�����|�z��lc%�u�a@��,�<ldiD'��G�ue�����������0�,6��l��o�^����jkd���l��$�=��=�'��z
�q���E���?-!��<�7������������:��e�A/�l��/� Y�bf��e0�������5����.w'��#���"���7_R�N���e�/�%,s]������^H��������:JH����p��i\
c�V3��M�4�V�4�&CW�BW!�������j��<< ����=�= ����������Ag��8��Cd3����q3|`�OH(��d=�@x��>3�<�!���wVq@	���"m�8w�S�*+w$K����q�&��` �pE���
x���W��+�1���z��<�AC���z�����^y�pX'yo���&$S�8�G��L����r�q/� 
7H���	yg��;O������}�!���#2Y���m��H}�>{s	�
��=G��C�j��z����x�f���B	�������g_Y�ni�=mUt������(���l����
���� 0<��-�����?�������7�Q���9(�U�����@�
��w,�@]�N��t������������m�*�W9 �\(d�|���F�WvB��R<)fe�L=���y�>�� ���&v���}�n���f�;u�lA8�p��^�H;�{��[��|�9�K	b��%0t��s�t��������@�j�������J8y����|�Z��x���U�H���@��2t�&�7�T�����a���n�t��u.^zUs5��#l�4U�*�L�E>d�h� ����M��[3N�V��������3����.)nj�Q��[R�q������q/���A�,L{��^���SXY�'���	o��EQ/j(�ao4"�4��h�2nx�fE�}=����U�~��@f"�-\���T^���ShChgo�����O~<z�&2(/���E�Q�G�)�M=�n4�x�R��=�k��6,G�n~��b'1���c���S{�B%[2E��a��oWY�j�u	iYi'�E	N�^6A�!���W�������������T�[������>e4��dGa����g
q��� ��t���I�
��� k����r�����pg����A��>��X��]{~����N
�����`<��a>��E���@����'���L�3$��
�nn����m�Y4wQ������5�Q��S@|]5RVC������,>�������Ge���p_��T�a��T����RYc���*���d*�Hn#��Ee�M�Zu����|d�|��	�������=����O�h��M�>]'����7*�+�$9	���E[�7^P��y��y��(lx���4&����!rE��/�,{O����G�,$M��n,DwW�^�e��
"��yo`���-F&=���96sI9��q���*�k�s	�/9���DbW
��.R���(y����d�m��w��0��x^:,\�E���3?�o��_A*yO����]K_Ty��"�������&��s1�m�eU��*{�����������8?�����%'�����;����Bs_z���{v�����A��u�-^b�	��t<���&��CK:s���j��3��m5�mX��HL5��>�i@<7�~���/�������v������jc�5������� �-��Q[$q�K��.����FY��%U� ���� ��$?��v�o"Q��9�+�L
W|K����A����
�N}Y�����zI��p@��B���J������a�kb����>���d��nv�����zj>w��o��`M&
���!�I��dk��?�1j�C�x@q6E���y`W���.I	-���wU�R�V^4p��]F�N�4�1Ps�)���/X��H�D��V�zs���D1�='y���a%�M0G���������K5i���R0���$A����S�S<����e��/s��A�y��gcP�����3_�����7��o�������L�=?����b��c�+����0dF��TJ��������V�T�����ys�S�����@�%��<������6�H���_O��!�S�����v���m���"�e�r��u�W+��V+�����-��M�������E�������-�q73u�S8��4�m��j��r��O&z�A���|��R�a+��	:�6���Y�Cn(`��k�JRj>2��!C��54��"���9����2�+6�&5i�8!g��s��������P�'	�=sa?J*�HZW3�$��u:���G[���Z��vj��c���E|Js��T��m������W5�h5%���c�]<@��=�����@��S�E�;*�Vp��}�>b�N�c8AH � ���}�����M�n�y��f>��G��jb����@
Ke(���-�X�<j����X��!h�qnmq�����7x�����4��N@n���t@��0�>��0[��[��+������)[AD�t��\�[_�Mpt��� �@S�z���!P�
|�W����	�L����zv��������I���kR�@1��oD��a�
�������%���Zk�c.���k���e
���E��Su��s���4�eF��{�������`����<�=��a ��4.�xA%�~��o�����1`*	��W��iQ�)��0��\��3�I=���(3W���k����~���e[���4d1�@�� ;tZ��(z��Rx`��^Jj"����O)�
���b��pO�r�����}j}�}���<`@S�W�K>-Fo�L~#��+�]�u���8����b��,!�%���Xf�!��������y�t�ov�Uw�A�����V��
���������]�[
��+��j�_$-[U�NE'�jaL��SsZ�g�xz�R����j�+\��F����>�����?!�x�if�Wn�s����RV��.��3�����lFi�Y���-J[����Hjc.&u�<
/�XC�*�Y8qv}����6�P��������lN����t���$�����|EY�b=���2N%�S��E�XGU\�af���I�#"P=0i(�-Y�I3G-��������������Uz@��z������K���^o����;��8%kG�j��~_�<�
PX��YO�������3+b��izv����e;`��X]�MJ���_��������F!ip����K>i��FW)��)�~�����������UWI��c�z'��%0������))���h��T��m����Fr37��jx��7%+�����-��7�/������<��=#L(@L���fh�W,$
kaVv��(��="��Z�Bk�	t��G��)��[��8�$��Z�&B�~"��^1$-�v��[	�#N����V�[k�����!�u-<s2H,ae�����G�)B���Or�����*a��1Z2n�[R������o�8����y���91��<	������xD�������sW8�$�����3�9r�����5�
}�����}�� ����.,�(�7p��$�e	s
X��so��3�6+uK���/w?�d���+;�����v[v7���>q�at�"r8�O����f�z�D��8Z��;���$�����L�������o�L���m�J�?��>�
�rG��-i����J�i��fYo���\��H
I:T�.�IGe��*~.������$�c����|�/F��������G+��A;����h��a��AuQ-���<(���7w�?b������
�z�<�����3����r�����Y����
�g��EGv�k��{���������q��������5�����D��!��������\_��.�!&,����t�Qx7��d�8��|����`�����M����)XH�	�9��-#.�_Y:�a���M������q�?5���a{��=��dD�y}������=�+����;O<��O��<&V�Sj)��/�E����q5��~d}���/dMM*R�8��N/�4A��o�;���+r!}=�k��+%HzN�����n�W�bq.�~�N�d�m�*s��5�������t�~0��$�7L���Tr�pC�����`g��;�x�L
�#�|,J0$�b�M��z��)��fwW�F/
&&'��|��l-M4��������)��f�� R]�+"��
|o)l��aI��=lM�M�����a�N��P��
��?�������R��{��y�v-\[c���WMfBB���P���iK�}���c�g����������@��j�ng�f#��,l�@�ILu�k�b�����rq������5J��M��I���Jz`E������������~�V�XB8+���8#"�V9���/������K��� .����\�;�+TH<����@}�Y�N�~2;���32�g��s�nY����b��L�w'��@=#R.{���}���Y��������m`��jnY)1���Y(+b�������	/��	t�����T������x��[��,�B���I<:q
����>m�0�\6���}o����S��k6_���W�:'�M�2�E;{����J�����L������^�|&�����,^��-2��j�_Y����{�C����zGb����#wh��1��m$�����x!���;���U
1�+���Xq�|=^/
���I���ko�>;rp��@F���c\��G��%����������<G�s~~�������y���>��5T:�E9>�Dv7}��w�?7���+�)�������E������!�a�C���E��F���V�'�zi��l��jT�1���0����as��:(B;Z$�h&�%G�A���uN��?uN�n�����?/���<�?��)g�6�������=�����	����|hR���V����uT!f�5!FU���4K�\Zm����Y��^fqT�$6���4��9��=�;�pm���!?X*u���1>.�+�����%�3D��� ��,���������V��l<��f���E-`�\�t�T�(���4Fm����3d���/��f����
O������ =�L�/s�v��tD\���6����8����������/#)S�}����T��)�eb7z�	�]b�>���/��-������/�(/���D8L���\������G��V��Z|�&?��j���
]���D��l<����7�V ����J���������i>�@>F%N��?�@�j��]��I�����x�f�HZWMD�����A5�	�L�h�����������zj��i�Z���nY[���(!.��<����)�~�)��:�4���O+����N����`�X����\X]	���OS+x�� /n�O!�en���b�^�JS�1����s�?x�:�7��i9���$�K��������*��sN^R��W����|�8�
�k����d������<��:y�A_	��@/���=�F���	����������f�o�������J��l4<��=�k�D&���������R02B4�;���<	y��7
�*����ZK0$NL����X����)�|i��0t�|l�L`�����������[�)��Hv���S�-���N��:7�"��+���X��'�$�,K�+�n#��]��y��^��"��)��z����{xupx�jo���$n���?����2�X���"�YP����&~��Nq�t���qr��>i�a����(zt��������������3�����A@��r+������6n�0�����)���wc�v8���
��9��9�"O1����I7Wsw��8z(�}������L�M���6U�Z�:��Gm}}�b���f��e��#�y��������^ql]G����~r�+T������R��YU�����~��%J���L�_/��8�R&A�V����R����� �
���(�'6p.�@�V�������7��)�~��e�"e`fe"�3���iV�J���f����)vN-��/�6�NC'�L�.�wP��;T�������@�o�	g�L!��������)��\�4�w�W�,�o{���6{
�Sl�R��H�h�%��U!R���E�����Iy���&�H����89^�6e*�fP��p(�{0��mI��z���0�N��5�ojP�@�gWS�w����t�k;�	��������{L�5�Bw@�*.�����oS���I���������?G�[�Qj��+|5���@���*.p0���9 �����������6|.�N�,�����c�d��������p�7������ib������L$���?��������2�qL(<p��/*�[��/�|����87�v��h)�����<���?��G*�`Q�8���-q0�� �)�?�q�f��������c������_�z6��)Z����sfJw�@N�&ll�������z�����k�`�����:�n����,[������!3;Z�!��[���c^��������/�?-}���+�#k^��oDt��r�����zf�C�6C���R�pH���_����mF�aW�����R�YR�����U���m�vN;0�6lD
G�R��!��Z{����,9��'X���B��<[�x~����
������:1��$Q��������������-[��#�@FV�S@�5��V��=P����D���@t�y�mY����[o�~�;<x}�bE��������f&j�-U�3����X�b������;
���a�?���������5,��z��]��ld����� K�7]�4�^��Z�����4����C��{�y�WL�f�f-��7�y�����=]�����N4T
����������8����x=C�-U����`5����6�D���M�	����;���z���Fq���pp���v������u@�V^��
��+a����.�3*������$��A�~��h�r�����	��7��Hrp�#�I �VAz����
��!����|��%�y�m�i��aX����J�v>��
?��u��QGkb��y������j��[���kq��:��]y]�dp`���IPu�8���m>V������
������E��S�&����=�(5]{%
��"��U�����##��u�O����8��]�:u��5���jbpCZ>2�S��wK*��/O�������}0������J�
�k=Dy�N7P���Q�!jq�T5]zeX�QX����4�Q	�m�0]���)t�Tm��7]\���m���5����;�t�^v~ig��
u�����.���=;<��p�=+��Cs:x�`wq���D�O^��6z*�y���������W0�7�{��I,�)NUt
U�1�g3���^�P�,�3�,��i�������n��e>s���an�n�`���O����.��9����d'��6= Qn���)��~o���S����0�C
X��#fu^��,J���vc%b�%J(�<�v����,m�ngMR�����f������h�������g���,=���Er�-�@��7Wk4�4z���.��;Ez�B��R�Vz�+=���q���n���# vs#�I)Ae#�+#��U'�)�^�GBy<:�SP�,�|D���{�|{F8e�z'�G���Gk�E�����Y�������N�.T�;'ennp�v����,��V��W^���8���k����=�n����N��^�Uq����@�	�������)�Q�������Z[�T`�n,;v��"��z�AN�#�~�?��zaYH;�x�
����6c�!���T��� ����������e!2����B� ��"�"��iz�����Z_l����x��w�0���Eh������k�����
���b�MRI�|�3�=�Y�/�IM:g��������*��bje�9�k�_y-ep��PJ~�f����t���Y�95�2s���V2 �y>��s�
��t.{%���@���� 	O:��.n8JS��,�W�	^*� @������K%�O��^��C��<�����p(P����263��}��������K�,�����
�-&R�:���R�^��o_�5M��)������c�>���u�������z���w��F!�e"��~���E����� /E[|\�>���E�T�]������'[z�3u1� �<�|}�i�'u��n�t0vG���M�LlEO&��%�.D���`ee%��7�3����Zv�QN�p5�A�pxmxwe�������&73���jo}��vf~]��`�K�<pf����;��a��������*>���C���}�>�vL�e��F�����������-CGo���r����a�z8>7���@d�	~����M5�Pq/1����w���������~oX�n��������������=Pi�������/�H��b$�����G��r��I��t1���u����;�A 	��{��p|�A�/'���x�����u��E��|
�l�����V���^�Q���*\n\���KM�F�MW����A��Y���$7jRP���JM��A~^&o�^w��N:������*U�	e��fsn��'��\h�B�+��,�,tR��sH�!Jlv���=:8�v��=�������*�vhn�}�����Lal[�JX�~�~2*�s��{~tu���u����u:������#d�"QqG��[�ihu9��&���Pe�=.'��Z��"	��6Q`�jN�����y��uUL�`��J������������+V�����������3j�7�S�`q��W��E���M@���
�~�}�K��H)0��3n�f���v�u���7=����H\���|FE��O>�i�
4�������^Zi�M�9�0vA��?;l`"���P]���O�#g7�sD�m�� �x���V�x��gRD>��fX�8�[SQ���S����o}���	�J�s:F�������P����K����vz�T}�]	e[)��
��7�T=/9	tG�h��s�m�H�2�Tix7KE*���G�DI� ����Z�)	CE�o���B�l��[������*����j�CSU��j3��s�;���I�,����%D��M���f�fJ/
����mIm�����fv�A�0�Z�^��������f��e~�A�Zp�������
��Kf�&���QJ����<������t����a�M����?>:��r&���{����I'�"���Rd��r�QF:�Ek�Ti3l+����/�������l�XjU������B�Ru��.	�40Y�f��c����p[��F\v�?�0��-��!	�����%p��������c�b42�h��F� _���"DU����$d�]B/
���������z��0~���q�$.�b'��V�������F��Z��3Q"	���4�I�.�C��
�����
�� ���(<q:����7��*q�>��!��%I���-xSx�{���J�����{`�EkT	��w9�^WJ%�����*���_+g	��e^���+�N��2���P�h��#/�G�
�&5���-hB_���vE��fB�������C�h����j�_��|Ld����[B�[0:#}g��O�Y��8z�M��V�H|Q�����<uG���eTAY�<��v��;mh	��R�m���lN���Xz].��	qaki��h��PE��7�'M������#:��E�:0i?�4�+�,��������Sn-�Fp�
��������{t|�=<�{�y��j�_�>mRU�V)��
Z��+_�C����������b�\��0�W�7�T�,��a�A/x`<�����2�:��������<n_M$<I����Uq!@��_�����$:h
5U�S���]C=�2j�(]fc�tD�"���tz��<x�u��������m����7X���B}�����&!Gi�|h}�&~���Z��[U�����?�X�YnU.����6���F�|���s�9����O���?U�_GC�wj��F���q�z���=4��;y)������j�$����#N���nn����
"��y��?�hlNdkvb��k]���L��[H���-��$3`���}:�
R�h���x�|2����1��M,��e\�JSR�H���&���j��H|�D���w�xs$�=���g'o��<�p��������*AS���������M��H������X.���9��o���
���(*�WM��Xd�e�C
���IM���}=�Q��3S��^Y�����q��wV��`~�^��������t�}R"�{�� �X�?��"C�9(j�������mA#9�M��D�p}G'��e�H�.+VR������-�������c�w��+'��Q�N���Vp�4hf��4�bx�D8K@��':��$�m���Zj��v�Y�S}Nc���P�F��0��u���L����`�E��Z�_[B�����6Sc-3\�RBN�/|�������q��'b�b�6�1d�z����}����y�R 2	}~�MTn�?�����^���(IBEz7�HV��(|�$u��n!+	 Na��o<t{2s���b0fd��4��Z{8���x9�K���b�������E����������rTzX�T��#]�YI�nok���������S=��2�1�	HA��q�>i����jf8�Z�E��?���4<�(U�'�a�Xl$P�����#��I"�����J��p�n9h"E�G�z�=n�\�2�D������PJ	I� ����1����Y?�w�.v,��vQ�~�
��������FLp��6t�-D�����������qw����������D�
�aj|J�LW��t{n��?�������]#-��`�>�ze���Q�H{�ks���������Dk~��-V[U�����1�B/v�^��X��w��Q��6������d�]S�/�K!1�4�S��l,�%��Z{���c��(�=`� ����������DH��P�
�WM�H�\v2U�#��z�U":b���JcM�x��l��+�=���Y+>���/O��\���k���N�}z�p���.�9�����w�����/��j���}x��v���EsU��g,�:�%���h�Y�1��d&��x'�]��Gy�bc ����l�P�9i<�8GT�@Q��Pp�+�X:gmN��-�d��,&��A���F���/>��gc�=���f��&]q��hI���2\j}S���3�nh4���L�t������h.^P�����V0��z��+?&f���}*6g�'��^�����8��CC�t��3�#l�96(�3���<O��b����bp�\i��\g����?����x.�x��8�
f���V���*K�J����'���N��do�zBd#��M��i�a����iq
9�)G�j��n���������h�������:����
]���T����B� ];1��r�
GN
M?EJ���I_��.Tb[��IJ��	%�!O�T^f��� �ua����y��(�N>	��a����;{r�4S!)!j�q��}oZ��	mI���k����"XYT8#)��\��SJ8Yp��:"�kP`-�Q���J� ��L�54�JrnK#�Q�`�u �����H�Vc���.n8} "j}%�;����BP�O�xx�B���WW����>��V�in��.����h6���C���9����Q����h/So�h�!0|Y���E8s����	���h����a`��4P"�?x�t������s*�]eQ����w�c���q�:[0#�L�(���qc\jD�C�K�)��~�a�j�+�3-~!t�	I"P����X��@����A[������s�o_����(0�^������U \$���[Dt|�Am�2����m���!M���"p��|-�9}�[v�@]�+�A���#B���Z��[�>7���eby6�u�>��U���e�%f����<y����W��O�C��-r��kQ]���wHm4���X���bP
nZ=��7��D����������V����1��'���[��	M��N�����/:�R�
���e5�d�]���7�I��U3"�����L���Z�u�@x�T���j|F�:���C��Z?��s���[�DuOkX���d��6*#�1?^������l'�����m��H��^>���9����W����>��%xe��k�t�&��fq�wwTf>�����nu���d��l�ET�-�Bi������<���}P��W�����G�:�O�z($�
�	���
gi��0������=� ������e(8�OIZ1�#@d/(������#�����RL_}�����QE��?����dyV4K�o����e�M���l;�++��E��a�v�4���O��J=�z�F
�����4����V�t3TN�i-t��BO�����x7=���!fK�������'�jOH)�������U�/#���@�����>[��e���Z��yy��_�u����������`[RH���T]_"�}s�Gx��N�#��k���j�:���S{K
{V����*%�`V��%��������[}m��n�������M��C��K��b��v�l�L�{ZY�P��l��U�GS�����k�$���D4�_V�i��="y�
N8�e$��[pl�e~4�Han�����7����K��X����y��}����O]�|�W>�������%2������>�j���jU�����Z e�kq��b���"���Z���nX>�qZX���Z�t�#�����Ygz��,oh�]���{�3���=���
c-���Y�2��d`��{�]U�p�L��������w���E?���Kq��r�$��
�#�W*7"�A�\V}e��{>�,I�*�G��KoJ�R��@:�f��O{������H����!��C��'�������I�ai��j	L�����{@�/h,"�S���V(�'����V�\I�+9�:�(t�����o3fte���}����kX��]��v�Y�g�}`g���s������x�����W
���P��QO	��i�9�}�fc�����w�{�e2E<��RR<��3�F1���u>rx�����|T��dB9��J��+��M���0Zj���Kc?H�	8���pV����"
>�� ���L
�Kpc�6)9��1^sP�b*f9m�g�����Y�,�K^��u���������~.TTN����=��J��Ul��#��yH��`
GLB��c��n��_�����"~u��[������K�)�����{�[�#d���`�����NM�����p����F3���%qx�4�o`8���@<������$�����|a�mwsq�i>*�~
��1��������a/�����XM+�"�3�#�8<M3l,H�A���*vA�������;�d|��:`���p\��Rh��a�����~Y��[����s�x�Ex�������T�(���4}Y|�G�4�������Q�i	1Q������7�h^�t���*������Qx�M�����$��Z�#���`C�������}���w�;FM���'a�a���������xH�Z��f\����E�������;z�/�)���/��^B�/ 'C�R��`N��,������.�� %��9.��*(�I��j���Dmu����������H�p�3��l�%�9���}�dnU��*�[L>Y"c���B�b����/����$hs�)8���KO8nE�m����'�Tn�E������B=�0l(���Ub�t
k���'�9^���F����#�#D
����d`�Xsd����@
�a03E�T����y\p���������#����C��f�������l
[aBJHJ�,�������w�V{�9���!���h�`OEN
$��5��i(t���BE�a�_�7e���GA<�w��t��,?-�j)S
)�;��KD+���i;���M������>5�le<-�qi�Qq��SQ��]O����|��y~S���(w��wp�a��T���.��;�(D�$iJ!��fm�I�P��g��(R��1���f?+��e��
���:K���Zf�����"�.RZ H�2e�z���L�|Ie�EE�����%3�b3@	�I���l���l�<���~�����G��V��P$n������LO��Y2�$-Io%�3���G8AJ^ �����.��?9>>��8���X_
����?���.��Bv2���@�?
�pN�x+�d/�^��<��7�y�Q����x��t��{�NL�����M��Qw-�6����|��L.�`�qG�U9��%W���!_6;''�'��Q��*�N������;�(��G��}���s�G�	v�,g1���\���������U���4�������2��������i8�������i����H#��P����	;W�E�����	J��!���
�����I���G���D�9���UY@���K�jLAR��|�!��Z��8/uG1Lbj�1��m[X�p!B-I��� ?9����b�l+!~_E��Q����Vu���,����N�a��fR��8��uLR��4�:K�S����h�Z^�������z�m����3�$JW`�p��m�����^7� �Rq�L��A.#�&<:)�=������}H^��
��)B�=��!��_�?F��lzr
�����yqk�5�=l6�wAR����&�B���w�~����[H�%4'�Aj��#[F�J����i���0��L���()��*6�Z������I���f�:Q�c�|N�r�^)=�&KqDA��,c$BV��N�{����q�
�*p�^�${�~#�������r�������a�&n�!h�$�����|���2L��V��lL_]�k�D�G�Lg�@nH�cN��S��k�7�,�b��P�^?��A�D��|����.�5tpQ[	D�Z�������*t�P����rM��������J��L<BM���,
���e��6��L1�d
������zCn���p���x�������}�	N�DH4�������Q���o�����#����|i
8���J�����SU�������J�����7�=����7�v�-\!���Vb-��Yp!ovkg�Dm'Kk�h���&[L���R�Z,vI5!D-��'��Cz'Yo�2 [[�
Q4<�
��?��*\.���\V1G1���dS-�Y�%��P�)�������b�<�L�;V���l���!��*��r	�2��x3������~�A�������� �>{����y(�b"g��J��%n��!���N����7�� Jf:B�9����_^�egO�j�1N'E"���7�[����]�Q��u
���o���
���G	B�����b��H���b���@�a���Z{���#�U��#j1�]�P�s$Le����G���Y�r<"�"R7�"�t����RO����	6����li	��)��`��X��r�^��{�<L�)�0s��z����[��	p��]�Q�^�cP���������tp��B��Cz���#��y���<R�,��Ke����������0�����*�m�t1x �1cL�-���ZU�q�4�'a	(�{	V��*�!���M����
���#ArHJZ��3r}������&����g��7Jx���-�K��"m�"M&�R
mD.A}��$�s�!������&���PoZ����R����(�U(�)eqTx���U ���S%f�B����U$�0I2,������x���5�nv7�y�"�K���m�y
4�n���<���[�6tn�P?-�����NxU!�����S@o�����+/��������(#���}���������r-�
pC�f�*����&�����Z�F��40i�f%UP
����W�-�2��H#��g��<tw>NQh�T�Bh�����V�X
p�y��s������{�Ty�g�y�[a��5����M�tz��}Ht����P���u��!C�i�����vU��$�x���:w�����5q�������{�������/l9�*�S�,T�{���$C(����)�U�t�?���\�7�6D�I��P���q�
�u���8��~&(��8+�g������!b&�����2�z�%�.���g��B��&�Rx��u���_H�7��B���DZP���������(3�VzQ�����n1�ha�'j�5��
$������~E�z��7���@��������,���)s��Ep�q����IA���T�E��/&��Z$�$�?/��T��(!���j���yE[K�����C�	��G���4�L��aC�@1��3Tlb��p�����%l(�U��wx���������&�8>���"� ���P�y��	�A !J�v���{`�l�+z@���J����:D��=\���>���S>�9q���A*7dl���num�^C����"b��=-�-)!�����+l�,Q����u����$�_��^R���	2H] $�#������0T���g�F:�.l��8�.� oq�s��
�����e�������~�H����)x�����R�*�(��	��N;�������Y�F��9�h��q�/�����M���:����L�m�����7�V0���W���;7��"v�x ��`�`'c-���*"��A0�):��ts���Fc�M>^�����y,O�l��|B=���n2��)�^(���-���������:��M7��{��{�]�������K�����s��Fc�4�
z'�Y�I>�
g�z��k���y�
e�9E��-(��X"S�o���%�a���"�������g���3�����-yXG�F�oo7�3���L�*x�%��,lp����Wqu���-"g�Y<�]�d+���u��i���z�HT/|W�e�?@�q�����O�����%7}Q�h�A�B#,������b�L+�%3�p���;�6�v����<t�����>�AR�8���c&�,�#��'Q��Z���JN@vM=O<t��f������`,���B��u�~I�+��$��'l6�;����m[�<�#����ZU0�����C��f}y�s�y��yQ�|��.�9tLk��'�=)<�6�N@|Zb����,��>����v=`�����y���iA��gK_R�W�C����24��,)��R/IK}iZ�=��
���;W!�J4A�k�1[F������@�Pj�2{eS
��
X����KJ����@�	<Y8�0���$�T�����d:�\2�
@��������sz���37��q<�a-Z���+���v�<�y�X�c�UM��,C~OpH^��1�.y�E/B�6���l���xZ\#?q�z�����	���6((�Xi�!��aO�{��k�������)�^�C������+����T���K��Z.��7��*d����h�}0�
�>B�LK�R��h��iod�1!	�t�VV�H
�7�
���O�M�_�����B;�F��I�x���{�C2�B]�p����n����>�+(��|��i=`�E�N����7�;�Qx�gj1PQ�05 k�uA����q~
lK��+��3���7��}�A���E�Z��[R1:�@,U;[�����|����$}��M\~|�� �5I�^�ue��Y<�>^���%qQ��j�!;0�-���(��Y�����@e���g�5�c\���t��x��I��&�����9p�uQ�����n�F�1���R��U�����������~~6]"!�0|O��p��������N�?w��{��|����X�D�S�S �^��>�Rq��V�@�b�YT����|R��Is~��A�"tk�0{1/o1J*��d��_I4nD���X}������=�������u.�{�����C<CY��Y�������M�y{�O���������T�(�|mCk�Q�j�6�0]b��M�Bs2���U7�i����aF=�b��2�g>+���kkk��2W�����l��~������'�|����������Qw���i�Afh"�/�}M26������3S��r������T&�Z���W]����as�{h����B�M�r>�A]xN�	KA_]���������I����Y�G�|�������As-��a:r�9{wr�}����^�������-��&��	Z�`c�|��}��������������{�[�7t�_Y��+z����k]��	v��4������Y��WP���6����E�o8�#;�g�x����9H�l��v���>���K#�/����V��j����P�o����;T��{U�	O'�0�G����K���
}�w��+d�x|'��n����#$(x��y�F�R?k�,�YI
�|����cf���3�+�E"����;����o�pV5���;�q����-�<}���-lPf��7]3�;�����i�����%) �M�Q`?��h��k8��!���%@a��uE�����g���yB���i�������
�`!��b����,�b2�N;����3��V3s}����0���D�-�� �-Z7���hZP\�@����9;;��������l��LP?v�����5���x���H�;���C�6#�J�Mve�k�#f�).�|+��v^����$�p��8�$��6�F�%�y�H����A�O��dZ|4��~�(��\���io�d���?���u�7�=R��_=J��k�J����������"��A�(7��xe���+�L�Kxo��0��
/��}M���D����J�ZJ�dEHi:�6*{����0��|��oA�F�v#�n	 V��k�T	5�]@����N:
����#=	�u~�?|wz��_�>
��;���.K�h+�on�oL�I{�o�X�T���v���tj�������0u�od������Y"��J\�:�[	_�$�����0<�EW�����
U��W�x���+�m5�,�g�?�D?��$??����y�{�_�*��$��x-90���?�i����|�)�
���f�t�	R%6�'���'����,(�~
��7_����{��C}G�\YZu�|2�S�Q�El���cC�^�om��WZ<��8�,H4�]�e^�fa���a���>�5���vr"���������gV=�����G�8�[�����5�D%������+E9�(����������)�Yd���U
V�&1z�����gfj�T���T~������v*�EuG�W;��E�c��a����^:0���!�fW�<Xhqq1�3^��?��-��0�l��H��L������������ �N�����@�/Gu�go�h���N:
q����T�d����+&9k�I�f%7pz�)�c�X�g	]���o-(����n��rs����j����Re�@�� l�1;\1���8G�Nh�-���Y+%8��^`���b���"@is�'�[��]V����+*��_��L�5%���~|���1X>�	��.b+}����wl������a&b
^:�e�����nU����';���R���!���m�A���$Q��U�X�$�s�N_��P�7�B�k@Rf�����!��^IZA1&�%J��G=�Xq�jX��|Jr�:^Ym;�T�\�W����%g�����3�$�����q�A�)�`���=<8���O�g*�C��bE�-��;�:�]�QJ�%��5���$��b����'��Xa��2��o�j?��2
���6�����?E�j��*��[1���k`sAvT�&0�\m�����h0���'hjdANy�_�L9��sy��d?�_B�u3I�"� �-�.���L����D^�����6��:T�B��me�/�3�.��"��&d�K�O={E���^�hA/(R���P+-??�z�~����N/��y1��M�H��s������m�M��Wt�P��))�pK�������i�+��L)�����?�I�]3���i:��75Oz��9O�24c�3����d$7�{x�����.���V�&�q���s{����\_���,{!/�F�l �\R����%X��l�O�H��#�1���!/�B�0��HT��d}g�����e4�&���j�W"�}�y���U��W{�g/�������gI�v^�3;��B���J�m�r��9�
{�8j�O������!�poz9�����lg���9�,I��zi��#��������L~�,]����$���O�z�0����/p	d��i�p�ff
����
�������t�$:���9��P����4�>�}�~����>�)�L��������M�_��z�T�<�p�S;[���gpH�g���1��m5 ��l��	����4�G��pJ���|BU0����7���A���M��N������B�"p�*%.@x ��Pz2wW���>yB������Z2���i��/:��NO:t�3?�b��D�;f���?'��K'g{�Da�s�Zj�����1l�����X���i���
��������u`	=��E��e/.�Tho��~�1@�����jhx%�d���]������u ��p�l�a{�h��Rk��77W���GR�2���[�\�u+A���a�T���&j����2.��$��
#39Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#38)
Re: WIP: Fast GiST index build

I found that results of previous tests with USNOA2 were not fully correct.
Tables for tests with various parameters contained tuples in different
order. I assumed that query
CREATE TABLE tab2 AS (SELECT * FROM tab1);
creates tab2 as exact copy of tab1, i.e. tab2 contain tuples in same order
as tab1. But actually, it isn't always so. In aggregate with only few used
test cases it can cause significant error.
I'm going to make use some more thought-out testing method. Probably, some
more precise index quality measure exists (even for R-tree).

------
With best regards,
Alexander Korotkov.

#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#39)
Re: WIP: Fast GiST index build

Alexander Korotkov <aekorotkov@gmail.com> writes:

I found that results of previous tests with USNOA2 were not fully correct.
Tables for tests with various parameters contained tuples in different
order. I assumed that query
CREATE TABLE tab2 AS (SELECT * FROM tab1);
creates tab2 as exact copy of tab1, i.e. tab2 contain tuples in same order
as tab1. But actually, it isn't always so. In aggregate with only few used
test cases it can cause significant error.

For test purposes, you could turn off synchronize_seqscans to prevent
that.

regards, tom lane

#41Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#40)
Re: WIP: Fast GiST index build

On Fri, Jul 8, 2011 at 6:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

For test purposes, you could turn off synchronize_seqscans to prevent
that.

Thanks, it helps. I'm rerunning tests now.

------
With best regards,
Alexander Korotkov.

#42Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#41)
1 attachment(s)
Re: WIP: Fast GiST index build

New version of patch with a little more refactoring and comments.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.6.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.6.0.patch.gzDownload
��Ngist_fast_build-0.6.0.patch�<iw�8��������BI���8���D�����I�����IH�3EhxXV���}�@R�����jG"�B�P�`7�M�v�����������y2I:��LT��e��i����}�j����7��������[��nx����+
����z�&���������Z�	Q�}q�'�3�p�R�>�����
�$����_s�����e">Q���Okb(�4���X�� o���"_>���������������������L��j"���*��y.rd���I�!�*�l�MA�*aP�H�he�M*E>�_H0���a�SO��Ma�f>�z��}��_�+Wqk������ZG �_\���=g�h?_��o/No����hp}
������#6��i>��>��2LR9������~�����W��^��'i�~M�>��7���X��#��y�>�a���]��g��"U�;����D�����f ��^/�5hP3@{�{7h��~��-`A
��~�4��S�F�����l�oF�'�����^����Xy`�����,m�!l���k����������{-����KU������WdJUI����}�5K�e`L��L�DN���N��$�NU�&����wg���M\y|0�����F��]O��2O�R�������I_z�Q�s3�x\���'`�FA4�D�K���
G*���9��0}��x�������F�:���=1s��
SSh�w�V�K����}Y;%��w2+?!�W�)m�����M�n�bgq;[8I������;[��>s��{`;Tq�3gw������X��i:�����f��
��$k'Y�&r�w�<��D2�v�ZZ&���"$d�oO�a{������4gZ�j~�u1^���:��T� ��<r'����4��%z.x��x$�c�f�xDq��G��F���f2VHQ���p����T��)HZ���X~yA*)��6�)pM�Y�1�3_y_��P���}�lm�����������z,�'�ok=�O�	p���C@�0gN"f�����a���b���s�ZH����>�����NQ�� ��p��;
��%p]>�(��������Z��������Y`I�BA$�s%2&i��������GQ���t�>k�p�����t�w�������~��[�2uD��c"���"�����A��2�p������d���I��!Y"\) .�y��-�L���6��Zp�q������
�BA�9��[D��^��t���� ���;	���&����O2Vmq����XU�(�Gq�=h�� ����:�|0I��X2��L�G��n����~F��0��e([i0�B���o�T��O��p�V��b2�NU3}�Z��?��	 ���}"@�3t_~�[��V~��v�3��i��{����-�rhHY^Bs+����e:��E�M����h��mX-��E B�R�h�6S�$)52�D�@�6^G)��e[��lj���;i��W'[����(��P�#@(]��O�=��B�P�D����(%@��R
�?���Y�_��z,+��v^���W�S
U�h��,R�U������
�U��6(����������ii	�'(��@@��mq$1��6�Z��%~��3vY��0��>x����H��#*9�p!b�����x�:	r��V���b�������M	)V�\�3w�
l�!�J�R�����
���h��71���d#�rvM �p���36�����=�H���h��o����I��hV���
��1���07�>��M ����;�J&�F�6������\UB.bv�6��E�5�����K���l;�iCXI�m^2����`Rc�I����
��4����O�F��B���B��[y��XM��3�����@�����_�h����>q�{�06H�)��iB��(�K$c�q�WJ�4=1{d8,��������#�z��t������=�e�k��$,����2�P�
n���:0C��Qe�]��� ���W��d<
��`�^l���:V��<Ga3�����<�v��T��
1��MAy��a����bPZ3+I�sn��EUX�Paq��h�e���HcTU�,�w<VBYF�2�z���S�|�@��`e��S��|W�<�~w�"����b,�O����&�F��~;-skYUr������N)X����5��0d��iQQ��`�W�p�������8��5H8/?�e2UQ�����P7y�P� ns�U�{�q��CB��x5kR(����vqDi1\����1����a��)�{+�k�:H����9fE��B��Pv�0D��=Cq�1&
4���K��B��oq�����jn3��8Om�����	�4���
�n\�Q���I^���h����@����5\K���������0)�km=`4�UK��g:���9��
t"����&�m;�`���XMU�s�V��H�a�S,��
���e.�C�X=��tFw�m����h���<7"���@me�����j���=�����;1��F�c�>�	�qv��&���5�M
�2s�J�����XEE�Iz���R���k�
��4e?=Q�K�yq���(�l@L��$��,��Z�s���Y,��RT�P�����
i��EL���C�m��\�0uBd���*��4��y>(��!_]��D�+A��m��R����I[��%�e{����H5s�wXf�����m�V�f 1��k�|��`�����R����n�q��+�T������"Pea{�����6�of�l-IU��!��I{�Vj����F�.�����4�?�������s����8�������]_�����_
��o���m"�c|*\���lB��i'��W����������3�F��?���O��E ��H��A
S�OE���2��`�;��� j�����?�[/�D�L`����j .�@:�����/��K^�[�]��gP3��c�d����'�������-��FCu��=Ia�x�.�<��@y����A��xT��/��^�z_���ur>B����hc`z���M��h>�a&��1�iI����SZ�nrB��79MZ��q��bc���"�#���
��'�J����IXaM�4����-\��46��&�)b�r[$�&�-L8�����s��Y,���4y�����z8Ldz��*�!����u��a��AV0����Q5:ES��+`�� �46���8;���o4V�o��s6�#��-y�����'����R�����=��� 4M������Cim 
�@�����h���16�m!^�I�Q���l�Cpwx�&4�������N���r���p����Dq�s��-L60� v��� 2E��T�)��X������C��'�����o��]��:t���N�on��cZ�y'�Z���>o.}ws��\���d��>O}��w"�G{
}��fU��Bw��}�d!HAL��x����>/g[�~��R���K!B��7,������Ku�l%^
��0���B4�Qf�YB[|�<��eRY��5A�L��w���o���C��0g�c����!�yH�4bx�
�������u�kL����H>��D�6�s�'���X�\^A��h�l��[��#)1C��FAj=k}=�{������{A��.���0R���+9����hj�	E� �0����!NN������w�{?^]s_�D��(	����Qpw���n/H����mY�j���#.�z����?0�y����k|M>����������F}���1��7MYt�c<=%b\S��Fwm�y��yc��h����*���t�|��!d6'tT
��� �W�@vE��~��:W$
�g�������	����$���;���
���/���u��/>:b�Pj����w{Ey�#k�u���VL��D���w��~A�)�����T1z�z��#�>����[�����������r��R�y�[�b�^�H5��S�����m.2v���0#���)�z&q�*��C��[6��	��T�!���V�(����jg\�������Q���v�8�0)��:�k����r�$7�.��)�#��|�Z�4iR���Z�U*��P���R�s����,bLM�b(K��iGqAf�%��K�N��.`MC���V�E�\�K�N�����[
M?�C�\{�b��e];#�q�0���nd���q[>���������#��TZ[�w"�f�6\�������B��nE��;]��6��%��Y��@d��jA8A�A�'6���0�����i����#���l���-�w��b$������������	�9^
Q��e��6�8*L�.���j�n�.)W!�2��=s{l�7�q�� (Jb��q�x �+�,�g���?�9W��d��hB����������Et"��� 
��U��6F��T��=S���P���l�V��B_x"4�&�J`JL��/}@���B����K5@7�����x����1�B���+��-3�c�����LT�S��f2���)��8�T2��P�)F�7q(w	��k@�����{���n�wyz6��7��V�U�bk�o_���e������X����]���e�]��5lTs�7�M���i�K��l��Zyz lX��"��v����c����!���[�Z���}}��]��([���c������l(����@����j���
[��@���ssXB�\�(�y!����N�c�E<9�@�i3'WD�����o�lo:[�{����y��o��L��lR�Q�/*�8L>���57I�	���|H�Q=�>����q�BO�/�oQ����"U��]7�����c���P�(9S����Tw]��V�L��!h�����?&n��m;�{����a����J!��s�w
��{A�?�tv�w��!��",�,���`a����5��f�-����R#�Y��l�X�&lU�>��0��9]�D'_����-�`���a"�,As"�sc!p#C�=_����0��l���^�Y�{c�Yg����W&���Y@�:D��,u�R���������4�FY�e������Exj4�u3���f������W|)��d6��Z�$)p _�?i�Z��T���`��T%������+��-���k��[��]���k���e,��S.d5)��o�������A��Pt�J��P�D7D�~{�_���t�7������O�h��8���#���%O�s5�$�%�7��MSY�hX��WF��l
7[�]����l6,��
�f�@�����TY��u���<������O��%��`���A�-���0|��oa`�y�;������� 
���X_�/N�t��$P����#�h���3�V�W[N����I	�g�f�|��c�J.��9m����
��G�P�re�s�h��s#�
�bH �x���v��&Ax	�����B��"����kzV
J�~���0��:}1��_���^H��B�7�W�����$�ZPQ�����$�:0-6=�������X�7�����:���J/�!BKz�dr96��g�z����f`
�s�.�<�_��*p��X��FY���yV{B�M6�)������j�*I���x���z����~��8&���|^���,�b�Z�y,C��h���P���E���w�j������?E���#���I�<c�p��p�yr��������QK&��������]�n	�y&��Z!�]]�]�v��o�R��%�(��Q_�s�7E������g[������+����&	�������7_���-��L�g]	�+�����nh,��Y�e�
����<�������5�;{�"��O���������n`��)�rU�.mfwTk���#3r����w��~�?zv����4�`�X
_�j�zn7��c�����[��kv7AU��xr�ji�.����?>:���E����F�z��$�ds�?kN�[����O�({?�=�Q��� �-;u����a?�7h���S��Z���%���7�?����^�����,Y}��o�k?���ck���i�~��k,0(�#�d����,>�h^����c�d8 4��q��Sh��\�y����a�0���9���Ad�$�~d^�9����C:������Cl@����a���K�i���?�I��6������j��"b��o[���
����u
��	�5�LZ�fUb2�m�`�9��g�^��@'�]�,�t�C�j�Z�?9����3����$m=��n��U}������+�bS� ��O���Z���l{��7�	��[/��'GG{h*?9|�c��0��������{:Z �����8�,����`��3C��m������B�f%�Y�����]�)��j��wcr&A1�.��^+�z���������F�{����+�C?�7��|����@�LQ$���a���\��}#
������������������2�e����u"K\���B����W��|�����;���dZ|5vP���u��K���z9���d��e�����_��2���rZ�VG	��q���!��al�j�8���������dH�jUH�j"����6���\�������<�= ��������^Zz���4�,g��lF�5�0n�����0*Y�V����o��
�j<�8���R��6b\����*w;�>��C���4�G/~!\�p�{�>���j@�DR���p�G�Zf����2;�n�>��[^-�i��CtE�,g��e� u�4KR�����a�wt,6=��QL�[��'���V\�v��t
���J�@Q��gV��#�h�IK�L�R��dU8���X�G���D|����Y{�pNG/���NM���
��u�����J����U)�kX��n��lU�M���px��Sovs���hM��k�2�8�����������w��.W�����/��U+�?���F�	��w,+B���l��.���_�J����������<�W9���H>B'u#��K��9!�A)����JV�Z�QG�l�����	N�}:A�{@������l{��P���#������2�xW~v?����H�?Y{Y�:���\���[�?	��N�>Y����~�]�G�fo~���]Q�r�G
��)sIj�3\���C�����!E����~�@�)��O��7j�Hz6�@��H����u1d���q�i�0i�;�vo��i{�.=b��o�'D
O�����e��V����K�a�!�@q(��������e����J�Y��&����w�|�C���d����
!;�H�l���Y��P��buU��Z^+dX�TL���Yhshgo��;�ON:~���)�Q{���F��w[�}���@���MT����u�P4��G�����Rsj���w�:���*��Z,�/J�����"����.# �J)��x�9Nc�""j�v�n�[��b�rZ
������O�w���&��)�U#3��qS�������?V�A7d{4���6�m����^��,}�H�@�����������5�A�x����|��*
;���������KG�5���
�b��K��u��4�P������5���������|<t���^J�8Z�>�{2;�����Z��c���HF���B��
�.y�{L��d*�H�"��Ee��K�Zu����4�iI�9
��v�p��=���#�W�m�EkG^���:q���w�����X'II�w��l�Z8����M}�����NEa��>E��D�����"k���b���������>o���]]O	C��cC���n�"
c]Lz���s�k���q2�a���S��?
��	Nr|=������]��eV�
�(�����0Sa����tP���8r��g��
���[A*�@����]K_����"�������&������0a����U�>��~�(���L^�	�$��Cj������;���,C3^>�������>��r3(�_�������#���^G�v��d|lIg���"����lM6u[�tVD�Zb�	�W�,���n|� y�^���W������?4a��.eVp�ySKA�[�k�$�H�����C\����R�
&�K�s}t��������(>�Q���D9r�Wr���U�Kn�F{s+�:�����.^-�Z��CQ�X�']�����g
C��$���,kx50�{���A�.,/�������sd�a9X������bH�S�2Y�v�O(���,��OByx,�� ��KUB������;UiY�x���(U'R�:j��� %8�z��Y�(qXE��U���w���6�Cp����+����R��C@3_.U
��k�;AT�n"�KR9�������	�u��a�%���^4�i�������_�jG����Kz�G#�����W����
��r@�P�K�2������Re(������S�*Yf�R�M*����'?x�d_Bk�>�@��Ilj+P�������i��9����jg�����>�[faW����j���j�Lb�]��v�r_S��4-.3��-��E�zK3���L�N�j'
�[%�� �\V�+ �v!Z�r>�N���v�Fe�qX���1?����\e)5���>��4u��o�@�B�2{�\�bpR�����`bK1D�})n��;�2]���X����Q�QDZ����Y/����U^>��}��J��_+E ;�V.����n���oK�>��=|�!|U��VQ��i�U���,$�A����	6�{uc$�����=�XA�Nh��Z�S����*(�371f�m�v����o��Y��y���UMi����S6\b�8K�������f��h�E���[i�W%�����m�]�P���r�h\����>�=��0��
�kW�^ ����{����k��3e+��"�'���Wl��\����3HDP�%$��zDj��c���60�|b2S(b�5���|&��b7�w�z(�����y����e"02��0�	EV!<����W���U��&>���r�+^����3�������	�9t�q��2#q������x�u���L���K�1�M��k���?���\B���	����9R�*��0�\
�����wx#Q�(���!$�B�[�}'��m=���
��P��7�TE���G��#���5��{)y����9�R(�����M��p���Lqx�6�����?���%k@P�_�I���a3yj��0|��J�wU6k*c���S����{YB�K>�1���VC$���I�������2�?�E	g�U\��$gQ5*~�EJ�����������VX��8��U�����P�Y��*JO��OY�d�����*�
�#g�Q.&��&�8��gd��������M;7�fy��,e���l:s#)���f������I8��E��=�$�6�bbP7�����_��f;'���i
OhS	E�����0���)B���nF`q���8�p��o��#�� u�C����"eB�y%�bj������f��Y�X"��s����e�4s���\��3��Q|%Zt�$����,�\�/YP�7�WuK/���T+��mz���f�Y�'.N(��)�4���}�\0r�����5F�=�\������V���.%D�/��Rh@AQ�j��B���IV�d
M����FXp_�0,��cY7#uA=v��1����{�p�N�K�a�k��sR���m� �5�8����g.���VyaJ�6�7��\��g�_\)�I���Y+���F�d�x#�������8-����er K�=)���� �E
�F����&�\��F�����6II�?��OK-;���1�CR�jM���;���T���[s�������y.<�2HKa��>d��)D�-B���\�Os����!+���=Z�n�kU������o������}�!��=1���<	������dD���/�C�sw>�$�����3�9��������-������}1����C�n8�e��z�M&T�	t_�t'u��83�6k�K���g��}	3�Y�u�)D%���t}�|�^��|�F�����=�U��������chn��X#�y>���3M8$��f|w�	�W0��}��+� �����O4(�I�����B,y��������l
-����C]1�F���mS�,-*d�p�7�'oN�e��M��N/F��������[s��A{���|i��ai��(�*�K�(�W������� jLaH.J�����r���/1)�����6u�R7+�)q���l�����b~�yy����sZ6�Z�x)	�S^�w��{������z������jSJ��z�/���'>������x7o�d�*"��|�������������y@X��	�9�-#.���=v5a�:�M������@U���a}��=��d��y�8~�N����W8,'�w�x(+�xyd�(��R<]_Z��)�t�1�n�_P�=�c����8o$k�RAsH��/wzd��	���oFJY�(��[@��
���������3kj�=~\������}$�o�T�IG\���&���1AN�����9�&��aj\��xE���+B-Q�Y�r����ejH)�SQ�Y���l���m��SqZ6��-z319�p�{�gki2��^��
e\���C���/���h�|�)l�������?
K���;�p���!���je�_/9��_��X��]�����E5�,��������>��mq���]

����w����"�/�]m��m���*?z�m)���	/�XL��s_�kg2�s	�&%�>N,/�:%���n����GV�hxAb�K�n�mJ��"^HIX�jF*�N#F;���U�.^�����?�X�ZEK��+�p�\��>5}�w|���A�I��z�����,{��Q'l�������)#y7��M��y�s���4��G�c� `��4�����$��M���]U]�xi�L��� ����2&�9^(h�
;��B��E�T�����A�	� ������&���Bk��V���ug--��	E�!�9���]��� `����G��:/��~}pz���'<|�?3y�r�*|��P2Zt;�L;��6��7��Q�m��j����b��h���3@<�C]�k���n�zq���_��Zi�(,�?�c� v1I�~�'R1�U��pp	Ld�����_,�Y�k������Bzj���I�1��	$a:>�~�9�O�9���y���dI�b������5JCOY��������[�F���.��'MnL�E��X�������E�	��@dC��K����xR�%�0�meZa�1��50�M�2��0�~���/h�����W!�p�:�g���N�N~i����Ea^^��y>N7�?S��^���a��=�����i����.�S�O��qm��X-F��q�VWADUU0����R�^qi=��uz��c|�M�A|���K��z����B}��B\.s�a��z����1���pLtT����H�"��e=�E����UJA��!��>o�fRv���(��"�����(�!�Q],Tg�P
�J���u�2�a�L�'u���T� Z��\^��)����#��|*2r��h�������C���QQ%������LiF���|.;�1�9~9/oa$p�������}�&��l��
�U�*�5��r?<�7����W
�]W�S4�>&"�]}��CM�1�,�N]�|�����`��~��f<n��44����"��	HvM����7	F�f����<iC���:��?s`)������"��xp�g������Z#��&.��`i����
!���H���?�b0���*��HE�����b���u���`M��=��S�`�J��V�j�;�����!|
!:ssE���Z4����9P�88\�]�����)�M������A=X9��u�]�"�${�9ixI�z_u�eh�1��P���H����yB�u���0Zb��^:�B���K�F����	���=���c���<)c�=�?��.�hx�{x� >�L�wW�G�a��d�l�w�y�2=Pb*D�?/�m�9��g�Y�mS���
+a�>��������%(������[a*�H,v���)S���3s�]����B|����~�a>V����.��u��M�^�������t��^�%��D�z���d�x}xt�zo���4��Z�-r$���������5�kvqJ���Z�����I�_��F��dv�������/�����x\`�_�o��X<1+y/��
�"%��j�M2����!-���N
�]�����R*��qut"����dZ!�@�@�t���(Kd���%��<���������^��7��ux��T,���+q�V�SU��������7aC�^�����l��"���2�0F(�/Q��L���^,���������s9%k�5�C�4x��������~j���v
4j��M�M�,���(+_ �����p\�(6�Y&����pn<�g?���6X��c�L�3}j)&}��q�����QTv�}���}�YnL����f�{VA�2�(���z�+�"��$V����f�~�s\��	_ �c+�L�\@bC+.A��
���!%��(!�YJ�2I��5���?���z�� ��+��L�b@����T�~���c�.�q]��c��f�v���fW�>�e�#�`��j�.�.�]�+"����m'�=z@��8� ]|���fW�f�]������6����1��%%z�9B�������]Y����P&�:��8gq����D���y�.�k��L����~��M��:q����`u�n0,�����#��/��m���v����O�O�s3�(�K�/ZC~/�e�v�4�a���}���t�����=zr�z��H�E�B�$�]V���_H�2�����T������3V� �4���{�L�q�d��U7����~��y�N��Y~)��@&cX�<he��7��t���!djh���~Ct��#6�#kE���x��80Tt��m�9�n����,[������N3;Z�N��[��c^����0����?-}���+�#k[�u-t��r�������f�C�9C��j�J8��]�P�/+�}K(<���������n)�,���@��:	����i;��4�b�#��M�����K+Wb:v�YJ�l"��9�vO�"�!�*<�#w�;����B{'��|�K���[Q"N���7��wb�mI�g`J/�-��7��R����Q�!6 �'�����������w2:3Y?���y�7M�9��m%Z��D>�#�+C�����z��������8\?�0�*���Kw��g�"�y�}*���8���w�mh��.[w�?���������6,��	�V�]���z	6��IxKL���^j`�S.g�E�:W�� <��@Du�A��mmn?5�QF[��������+�f�g-�h�mU���=]�TA�7�ti�M����
T/6�6
��6B,�F;���f!���F*���E� 4��V�Qh>�c�6w�GG�Nv���	C���_�?�v�f��)�s?�3�l����� �s�HA�9�G�s^I�a.����N8���D!�d���pQ@$������\B�k%���d�~*%�U���r�'�,��Y���+Q-�Hd�*�>�n4���=���K2�����j����������^������z%���'����	�Q��C��RJ�F�����V�p��nCt)t�C4y�/��R��W�
�^���*�!l��gd������y�����}|]����l�������
��������G�����n65�m�b9�c<��@�Rk�ZQX�����lTx��g>�M�^��8,dW�u����c�x�9�g�:L����������y|����?=�;?��:���=�*�X��|vtv���T��������
���W���U���p�z<`��������j�� ���G:�����3^.h1�q��c�O�=�)Z}�.�
y�2l�g�r��
�=I(�������g��c�V�5�U7���[�<��n�����tz���~+��j�4]e�;Z�s*8�y	[7�����/�V�N���J�bJ�y�5�M��h������C���F�v�$eeKk>h",]~�>H'���G2��D�1���<����f�9��'��o���.��Z+K���z�u�
��>6�/4������*�{��=b�L������@Q�w�2G��qu���
�|�'�#?��/vA^[����g7T���R����x�F�M~R��Q���Yn��(�������R��p��:�>��mg�T��7��������k��|X�Dz�0��(
`	��z���������I�~��WSX��}]��o���T`bo,;v��"#�z�
Q�#.��?����6�n��E��`�L��9��m���0"D���sf�s#�|����F-��x��>d��g0���z������G�u}�YlV�����-�W+��3";�7A�VF�O�?P���F%[����7h�-gE�T3�����S��R(��������0���R|}�����B)iU�}��szp���������e��������.��M�[�,��s�B���>UF ���EQ{���[1,��3B��-{����
)�F��I�������/+z|���W�����((�����_3Ieb���Ll�K�,��I���]&� :��� ^L��+P?�J�T�I���Ee&�C���?�O�r5����~Ohty��j�6�<H������4EO|�����_y���A����'Zz?u�� ��,#P}}i�'uQ�n�t�
vG����M��a��'�z$g-��`ee%����6������Zv�QN{p�4��
r���Sk������*�?�r3�_�����7�����kx�}�����g�|����#�})����S�yrmPC������8y�4{�l���o��7��3W��i~����7����t��O��7��6YT���.1����������a��~wX�n��������������{�����q)mu_w�`����������&4C���#
�+�b>�^�����yg2->�$�ug���r�<=*^N.;���
A��"E��� ��0Su}9�z�����ss.7��k�����{�_��N?�
��m��X�[7i*|hd���K?�G���7�����w{��;e��5W���9k�l�s�)zg��j��ld�#k�f��>o��$v�$Ra�l�S�������w���{W���]�c �I�B���bM9�%ic�RV��
t�Q����_<�����m����,7�pp�
�K�����7�l�����0����~B����@�uT�%HC�������qs�\��^�q�O8x�?�l����y^k�kN����6�be����:��*JkA���>���w.y/	k�k��*��P�~�� P)���3n�f���\��:}H���O���)g^��x�������:�������0:%��4����yh� ���50���l�nVb��'[��$���
�ah�d&C5��$)���P�F4��
r���.�J��=�|�� o,O�P��1Z�t8�v��q�l'_��w�������P��2xq�sh������c�@�T���X0�\�
�-�I��w�T�u|�~$L��dn�-y�y���0�����04��<�x�5�J��_ �BW���b5��8pm�3������w�8���W������i���?�L���]��1�-��.��]���4(��K��^�r7��0��l�����^%�e�u���y����y���+������E��g��4%�[��L�
��������z.�W�{�����AB7(#(��X�s1`Fp�O��	�����qs��+���K�A���fm�@!�d�&P�8b��/��p��a������F8{
N0�0�
@���5��,	~���P�%����������f��j�������+�"D����k�;�i�����F�O5n
Btl}�SYN���q�g%.�b�'�f��=�Tp���u�Z�$M�H�����9Z�����!n���v��d9���F��C<8���i����/""�:p��a���S���������Y���l$I�Da��P�u�����{�0����*'��e��� ���N���8��j�t���/��
��5���oBj�3h�zE�j����v�e��r�j���`5P�/K>H4Z�l�-M�-����K�����{a�9=/��n���V_��62O
��{�
����;�`��������n�����T�)��K��ea.�'���[�����i�45jTl���_������.�)6t~�>�iJ9!Y�q�%(�^��N������� �|���MT����������W�Vc�K�����n�A���S��btD�3z�~?��p\�����Q����I�����9�8���g�w�]�U���VZ�vO����'�,.6Zy"�*�K��U��a�/��/�PS9%M����Cd�����a6i�!e��3"��P�#�0t9} �.7�|���ig�����W�5P�+v��
�cn��JV������3�0~�jp-���*G�R��C��,�J�E�kS�VLl�~f�^vtp~p�M!<���7����Z<���-d������M#"�����T�_gS�-2F�Gk\�8��3����sU7�ra$�[�	��9������B�yQ~�+��,�����@2m�$ZE$��S�F��1l�j�Cx'�c$�����ob��.������Hz��4I�T�V"A��3	wT��=������������������GV��y0M�3��BE�r��)J�$�T����1�����|K�"]����}d
�����VS�5��qY��������ER��uf_O��~T�D��Tl�[��k�����E���x:����<#��.<>����S���"�aV{��jD.��#%�'g��9@���\m�8?\���KW�G,v�e,@c}XK�u���� ����c���TN�U��MY��=���i�$����1���e�����O7t�����c���z����������!��
������Q��E�/qA�������Jg��l��em��Zf��Q�[#E
r�}�C���Z�?�S�n���x���������*���`�nm���8UB����u`cA����%��K�:��*M���;�Ua���:����]p��z�H�?F,�@1p7�c>d�
�C�rGj���mjf��D�RL����f�s�He��GRn�0$~E��+6C%���-N�k�{��:��nU�u�.``ye�t��O��������V�� @�:?[Al���dX08�.-`����z�0�pa����@��E]���H��UW��5�U�a�BB�(����=*)!������Q�7h�58z�b��s�N��U�����-\M4b������C#B�Fh���Q�3l�\~�[
��YHN���������E�K�!�	��H9���<�U�kI��"#���?�)FE:]��c,��O0�Fb>�����R����L{�h\�;�+�J,[�;����(��i�re�ywT2>����������)Q�)Ycv&�N"��px�1VO���S��S�o��6��n{*]A���7�K��>��<���2"�����Q��QQ*I�������Q#���k��������)�,�z�e������a�!rV�\Dd�o��xvJ��2����j6���w��u�M���U�c�������I!���|��EcqI0���p"���~���T�<iG���9�������SeJ�)��_���7�o�Dh�k�bY���H{m-*��@������@+�s��ESJ:��3DK�`E|'"����M�7��p"��X'G�{�����"���XA	���>��������_�\�����#t	���Iw�W�m[����v�8��CC�ts�3#l�:V@
�3���V��b����bp�\Y��<���+0xs{�aV�R�F<\)~�:�%t�J��4���O:li�����J����Fp���'�C�I�#��*r:�%��Z��n��ZSz�@m,���5j��"�Q'|1�������
[Q-D	�����)��p�����Sl!��I������8E%��sPMR2:N(�y')3���	KH�#�-���-�&!�7L�QxnOn�f*$F�0N����M+�>���A/��mQ�T���A����7pELxVO)�f��5����1��TF�
�0u�G�2��L��GW53��7�vIa�\G�
-=�im�q)���������Wp���*�,�4��G~+DzH�u�2����j5�0&|=����f�r*P7{Ra2��}]=��"�
���1�@�W�Sgy��A*8p�%��4��@2� �g��^�]��G�%i($��J�SY�$c h ����l�n�U!�<��0u\�5��!�&��*�J_�0e7� ���l��$�����`���X��0��������������o_�r��20d_p����V�"�$���_D�|�^Bm�F���p�;��!U���"����|k��}�[��@��+G�w�CG���e����%��qh�
@�R�mx��#�YN�#�&p�p-�/1o���g��sL�g$W}�R�����^����M��������YM)������-�#��I�k��:��k��J�e������H���[���+�$�@k��S5%���_XVcY&�%*
��2��]5kBE�J�����\�^Z����rO�������EN#R#�z���1����.��@d[��A&��Q���D����w)q��9�e�@'	R{�+��x:k���o��?��'q/+Z('��~j]{�6�1'l![���]���.����s^��'��d{�U�FK��t��-!���6mF��>�MU�����Alf��~O�nrX7&?QS�j��f�zm���F�����������U4XU�������y(������
�h bJ&��T��xt������LX]��z�T��e�	K��X�����L�]����T�M��q��t �l���P�M�|�-6���iG�Z����8�7U����X��5����ew?�(������i,��S1r�%���p���&�����?� �����]A�^Z�R��F�| ������-w���������o<�X����h�}�b���Bz� ���2��kQ,��������f�o!��T�]�xj�la��;]��G������l<�,*'.PpPW_���[�/����sU�%����C�I^��(����g���xO+�� Jb6��,`�4�wij���Q�#���o2�����*`#��a�G$������C`����`��f)�-���\w�1tI�,�F�A�s@�s���R�
A]'�;���F
���c6��4+���q�!U��6�VU�����Z�/}
���V��!C�rb���X?�*�V}E��H�rF�3=U�M`h����g�i�{B���:%�mM�D?������������87y�=A%\}�p}���\
������6^If->S�K?T�O�A2�V}e�����>�l^�
�?��U&KoJ4S/�@:8h�H��{���������P�����C-��'&)������ai���	��o����O�/1-"�S���W( 5�M�bHe+M�H@99k ��(t�����o3ft��������r����)��
S�������As������x����W
���P��QW	�����9�]�fc�~�����9�elE�A��R�?pU4�Fq���u�rx�����|T��C9�tK�=,��M���0Zj���K#.?H 8����pV����"{�� ���L
�ype�6)-��1^sP�
b*fsm�g~�`�n��]���\`��:���oue�db�Z?�+��c�Z����lZEPD+����<|�v��#&�s�1���I7l�������cA��\���a�ui���q*Jh�^�p/�nud����cL�!��S���g0�X��h�|�$�p�D��'����bY�(4x�l����v7�������`i3'�.�kK�����Z����"+�=S<B��#�T����7��>�(�bT]�Q�J�v��k��Q����F�����vU�/L
�H`�A����e���|����G�^P�G�>�{N����	�JM����|�`NS�H����l��5����aMY���^��UMgZ��`�<��{��G�!6#���BH���j��=�x�
c�*+���*�i�59�t~��y�E�������K�p�vI��T;.���S3����wt�_�����%��Y�<|/_@N����&�-����	-?9\.,UA2<�s\��UP���	����tuI�������I�I�8�3�<m�$7v��5}�<uU��*�[l?������`!��sq��LT�ur�9\�yJ\��7��0A�w�_+m���u�NH'�F
�Vqn.rZec�N�Ud��J�"���j$*��>��B����M|�	��5g'�1N�Q�3����5���+��;�@e�<8>?���>b��fh������_;?����l��h},!=[��E�6�����0�l�_���c�8�!3���h�VE]d�j�8��B���<)T�'v�ux����x�Cyg=M�0���ry<�������y�r�JB�l���(�`[C<�8���O
5[�OKx\�e$����5����T�ReY�]O���KqY�C��-� +�o���D��w��EXL�fm�!�B�)/"�;�1%�N���G�G�%���,�f
�O����_�	:�/ 0��:���-����C��E��p
���h	�
������Ho�~�����1k���O_DJ�VC��&n/��u����+r:$���Y�����f�kbA!� _/�A���u}Qn:�''���G?��[���y@p������0d����l��A�����zvV�{o�R;`�������G���X}7������E�1oaw���n����������C	
\��"����r7k�����l������GT��pe�x�����`���d�Se��)������Ezhv����������P7�z��,��WNu�6�������_���
,�N&�B����K�r�^�r,M������_)����gUX�������D/a�$�"�o�w�&�?�K��Q�{��`��-!�Z@���nx�&�Y$8��0�\Cp��%���H ����S���j`���5��R�R37B[M��������I2�f����,��
���^_�f�I
�'�B������U'�I��b��	��/n,����"}KH�A�q�)����t�*������u2��������R�,� t��f�O��EdV����ug��oxykS7�=�:��7 ��V�J��!n��?Qgz���B��It�����ud��v�m��	��[�P}��FK��u��*n�>�������P�r���eWs��������1F�b�E��E"��l�b��~_��K��!:���/��2�~�2s���U4�jw�����m6�OQ���~i$��h*O�p|�@K��~2"�1��sC��r�0�M�m�z�e���<�QBBEsHj��&���!#:�8���C�7���L)k�<��!��#��5Q|p���Z9�kY=�M��,D�X������e��:��L1�c
��J�����g!�	�l{�7'::�[�a��I�D�.����F��g��'��\7�J#��5�L|�Y������vtVf���B�\SBn%\ot}o�4�~���hzm-�V�{^�,�7�5F�H��5M4�����&�S
*)X �������u�	�?u��$�X~��|��VwC�O}�,B���)]��
��� P/�:��B��0�~zK�������������Nq�|^)����tA�c�������x�s�ri���q�7j����������E������>G������������a+�M�/����X��=s$���������	a����ay�;v='n��q.71(F������|��d��z����@���Vj��@%$0*A��P�6X�SJ���XL4�%���PQ#��]��*}���f@8���	�2��T������,��U)�� �&�zC*b@O9�+80���+4�`�8�h�$���=�B�i�E�*��\���$���Slxx6��nu�,����� ����U�������X���7�,t;`�����>b]�^;����-�T�����s�L�����pzF�q[^�;���.fd<f���E0PbV���
`c�$,��|�jWE;d-��i�rSr1�U�#%��H���F��r(���d��������%A��~���lC�����4��5������O�\����j���S�/O!�@�<���:k�RP��$�0�Up�)eoTx���U����O%�BL��/��a�pX��������Y�k���u9�HE����C[s����g7y>
|���VtaR�G5����4Lxk!����S^�K���+����l��)#���������a����4o�+W�V�E
�:)���b�j<���I{ �P2|��B�0����������@����$"��W�����)$���S�S��d�\V4�I�!��}�|of$#�"2U.�5O(� 8���&�V���N�`f����]]�<&"-U�.z=dh5�������<��;�����\3~��zM����q<���^�K{��hj����Hg�if���Vw�$�BEN
����{����N��v�_r���Eol����q�
�u���8��~&(P�8������5�!"'g��k�2�z�%�$���g�����MV�c��N��-�W�+wk!KaW"-��b������Z�\-�(lx�^7�]�0CU�*��
A�VL�K������e�j��|�����U�qQ������~�r#bF�pA&���1U��b�������"	)	������%%D�P^���1�hk	���y�`�Q���%]/�}��hv���P�(���1r�cb`	�
c�*���y��a�����&4E����"�!���X�U����� AJ�v���}`m���^��Q�����M�=�
���"C�)�+���s����j����[�?�`]��P�|~P��~�!����%%%��3Qq���)��X�q�@r���{�N�%{�e�
�� ���B�"<C���p�kD�~]x�ih����������h�b#�����vx���27��x��:}p����LF*%,����f ;�'p*{���*�G"O�M����p��j�]@���\N�7�J��{2\����t�G*�b��8"����ny�)��vnKF���8@���Z���J��z��ULz���-]tv���Xq����7E�&�g�S%J(��P�f�A�6���� �V~�����q��QG����u�u���.]��������<�}t\i�����M�A��6K4�G���V/�5{CY{AQ5~��8���+s_�Z&�SW�����,���2������-yXw�F��o7�3��M�f<U�
������kj�����w��CO�,�.w������:��4h�w=Sq��C
�2� p����Y�������_�����q4��U���P�a�E@+~����yS8����g��OM���^����yB@�g)f�d������r����~���u���C�~���%5 ��_$:a���fy�yqW���.8���i�V<����&l��;7�]�u[?������]U���,p�3t��j-
��-tG��si�o�~s�h7����9�����P������x��S8�"�c���u�����c}��` �q��	7�4��]@
E��{
�������;���y_�Vw�����'���?!KM4[�k�11F���<�@!Qj�5{�S��x��*�PKW�\��@�
<z8$0�����T������t8�dD�p��]9�]��o	�/6�nFN��2�O�lX�Y����j�]OO;H^)Z�im�*����'i$oZ����<���A�T��l������,F~����5�W���S��m,PP��&�G\C$������	��S��3.O���3�������Fh"�2�*��K=�
���{��p4�>rh�� �Q����6��D����;2����`:�D+�y$�L���
��D�-\'��D��a�w�T�$��� ��=���P��]8�j}|�y�dx��

�-��hZ����q�u�*s7
��M5�o�d-�.���/��:�aQd	r����bf� W�'�5��<�t?$����&�@
�R�KW�Q�J�\�n^��];q����hB�$�{=��a^f��"z�����X�!s�"������K�1o/��|���0h�2NV�)*���0q��U�-!�`������0.2����k��l�sD��G��	;�Xt��6��TZ������'P;��7�8�.DXU�S>�����X4�	��NT�����c]����y�9a��m�C.o�4	H.��Ec�����'������O<x(���N������� J�	��D�G��U���w��F^�B^NX���s��P���e9�f�����y~�h�@�[4�:�=�/7��o��>��l+Z���_�f��R��M�BsR���U7�io���J>�ot1��\���|V�{�����e
�����o��l�l;�xaO0V���M������������������D�5^����r*
�o:o�M������o��fP��Y���u��''G�-S��)v;����6���p��s��d�H��6���v&
������w�����[�n<X������������������u���x����lb-���6&�7L�w����}��71TU����p���sO���i�����<����@��N�������u/~C���h�_�`��=�Q�	=�������y�c��VQ=hg����6��w���	���]#l-|�5�b�j��o�Q��KWP�{U�	�+[1(X����K���}�w��+dZ<��S�H��=v���68+�0�gm�1m�a��3���?����
�����X���bO��YY8��L�����@Kj�����k+��0|�M��c���rve�0�l�$��I0
�q�b~
��96�x������(����|����wO�_�$���<�H��� [�,�;��Y���e[L&����O���s��V3s�������\���-nM	�m%M�WT3(�o�������o:g��u@'��2�����/ZG}���3[����tG��8$U3�o���eW��+�����/�`H������+Z\8�d���' ����������`0/)��p������L�O���%��K��w���HX����-Y��pu�#U�����+P����
)^_N��j�-j�1�����������U����.��B�&b��j�}1����R���RCYR�����b)�+8L�*-�~��(�nd�-a����qm�*��f��w����A�����DO�b���?;��/j������L��G��Y77�G6f�����BX��l&���:M=1~U)����7��RJ������
�s%�k����_���9���9AW���R
�*��{��!�jr+pP��v�WJ�Q{\\%��h����1�N2O��!>)\���9�SBJC?���������'���'�����H���\��o��4�
���!�����,��p>�@Z��|�c�6h��,=8��[�y��
���T���, 4���U^�f9W��!���f��
��N�Adz���~���~������8Z:���c~k�x��F�Q���B�4:q(p�(3��� ����[*���
!{�f�j�B�0}���4no?3������14��>m?�|b'1]Tw$x�C,W�7F��>k����)jc�580��mvE�m�F�+��)�,Vk>p�8$u�r��!��:Xtx�(���T
l
��Hv�j�'��]�[�������:��	r����um�0�A��� 9����a5(TQ)�k��RSC3o������ML�������j���ZB��0�(S9����:�o%��`a�X�lS�:a�X[
�z����%]�x���?�
�Y�0��xP�C(��$�H�rn������	v'<t%?s��A��6��G�m1�q��Z��kj��,I������)g0��9��Sm&r�7��y:����m~j'J��I�v��� ���1U`1���%6T�5��t�K����wu���o�RdI%��,�?l�D`�_*q���k��k:��D���
&�
K-L���	;2�"/��A��Q������>���D&�x����hI�uw���8>���u8qO1�{�
�q#�����^�����^��B{m'�E��pqP_V�U���|R<��?B�i�a���$�����)W�_Rg
@N�Y,����r1$��K��9��MPOaa;\�����O����c2K� ���p�������-rw�	����I���e�0(.��<�A���V����8��5��|��#�?I����f2��w���VT��K]1���I!q���
�S��)ciR��v,�����Cj�-���U_IU��?��n(�����#���y�*5R�/��+�PS���A �F�G9�X��=-���^�jcE>�����!��'�Jm�����
iy�ZJ���{�tcnly8��&r6G�(N�M��E��l����&VFR#�;:<��`�H�
��u|!�����xm�~��W�~��y{��S�����h&���w�����������2��w��yn8���Im
��`&�������� =Q�V��
��cYx��U�����i��1��r����T�d����k�������@�����xb�$��)����+�L�v����W���~,��|0�z�)�MoBs)��g/#���u��,NYF�a�N����<�z�~����N/��E1��N*I��w�o��)M�y;�g)��+2�8pK"�-�����`+�M�,�����l7��3��sU��O��y����hd������l<7��~�o���-O`w�����i������5�w���������J7m}��*�@t�4���8��7��C����:�6:�L<*Mw�;"}���;����7��>���C�3��:x���
{��{g�/������?���7���>��Z��!�i��S�0Bg��\p������x�r����C��\�:��%&���y0���A�`~���������:k6=�V�.o����}����f�|a@81�_���3S������Y����I���4	�N&��s(�jeh_K��g���<{�����1���)�]�;����f�������'8���Om|v�P�#�>�����\w6C��.6$�Et0=�����c�V�K�������-nm:��?�o�����|��m %�Q�E��P27`4��Z;%�A]�p�Y24���iW��/	�;3�LE��X�m8�������_���%��b���J/(	a]��������_�Q��1�������f|�Y6<6"/���F\&i��j?��cS��:3�'��g�3
y=Mf���!?\gL��r7��(��9�������7p����<�^R����P����S��`�T���������Z��p��/FZ�&R�
#43Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#42)
Re: WIP: Fast GiST index build

Hi,

Looking at the performance test results again on the wiki page
(http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results),
the patch can be summarized like this: it reduces the number of disk
accesses, at the cost of some extra CPU work.

Is it possible to switch to the new buffering method in the middle of an
index build? We could use the plain insertion method until the index
grows to a certain size (effective_cache_size?), and switch to the
buffering method after that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#44Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#43)
Re: WIP: Fast GiST index build

Hi!

On Wed, Jul 13, 2011 at 12:33 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Is it possible to switch to the new buffering method in the middle of an
index build? We could use the plain insertion method until the index grows
to a certain size (effective_cache_size?), and switch to the buffering
method after that.

Yes, it seems to be possible. It also would be great to somehow detect case
of ordered data when regular index build performs well.

------
With best regards,
Alexander Korotkov.

#45Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#44)
Re: WIP: Fast GiST index build

On Wed, Jul 13, 2011 at 12:40 PM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

On Wed, Jul 13, 2011 at 12:33 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Is it possible to switch to the new buffering method in the middle of an
index build? We could use the plain insertion method until the index grows
to a certain size (effective_cache_size?), and switch to the buffering
method after that.

Yes, it seems to be possible.

It also gives possibility to get estimate of varlena size by real data
before start of buffering method using.

It also would be great to somehow detect case of ordered data when regular
index build performs well.

I think this case can be detected by the situation when most part of index
tuples are inserting into few leaf pages which was recently used.

------
With best regards,
Alexander Korotkov.

#46Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#42)
Re: WIP: Fast GiST index build

On 12.07.2011 11:34, Alexander Korotkov wrote:

New version of patch with a little more refactoring and comments.

Great! The README helps tremendously to understand this, thanks for that.

One thing that caught my eye is that when you empty a buffer, you load
the entire subtree below that buffer, down to the next buffered or leaf
level, into memory. Every page in that subtree is kept pinned. That is a
problem; in the general case, the buffer manager can only hold a modest
number of pages pinned at a time. Consider that the minimum value for
shared_buffers is just 16. That's unrealistically low for any real
system, but the default is only 32MB, which equals to just 4096 buffers.
A subtree could easily be larger than that.

I don't think you're benefiting at all from the buffering that BufFile
does for you, since you're reading/writing a full block at a time
anyway. You might as well use the file API in fd.c directly, ie.
OpenTemporaryFile/FileRead/FileWrite.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#47Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#46)
Re: WIP: Fast GiST index build

On Wed, Jul 13, 2011 at 5:59 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

One thing that caught my eye is that when you empty a buffer, you load the
entire subtree below that buffer, down to the next buffered or leaf level,
into memory. Every page in that subtree is kept pinned. That is a problem;
in the general case, the buffer manager can only hold a modest number of
pages pinned at a time. Consider that the minimum value for shared_buffers
is just 16. That's unrealistically low for any real system, but the default
is only 32MB, which equals to just 4096 buffers. A subtree could easily be
larger than that.

With level step = 1 we need only 2 levels in subtree. With mininun index
tuple size (12 bytes) each page can have at maximum 675. Thus I think
default shared_buffers is enough for level step = 1. I believe it's enough
to add check we have sufficient shared_buffers, isn't it?

I don't think you're benefiting at all from the buffering that BufFile does
for you, since you're reading/writing a full block at a time anyway. You
might as well use the file API in fd.c directly, ie.
OpenTemporaryFile/FileRead/**FileWrite.

BufFile is distributing temporary data through several files. AFAICS
postgres avoids working with files larger than 1GB. Size of tree buffers can
easily be greater. Without BufFile I need to maintain set of files manually.

------
With best regards,
Alexander Korotkov.

#48Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#47)
Re: WIP: Fast GiST index build

On 14.07.2011 11:33, Alexander Korotkov wrote:

On Wed, Jul 13, 2011 at 5:59 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

One thing that caught my eye is that when you empty a buffer, you load the
entire subtree below that buffer, down to the next buffered or leaf level,
into memory. Every page in that subtree is kept pinned. That is a problem;
in the general case, the buffer manager can only hold a modest number of
pages pinned at a time. Consider that the minimum value for shared_buffers
is just 16. That's unrealistically low for any real system, but the default
is only 32MB, which equals to just 4096 buffers. A subtree could easily be
larger than that.

With level step = 1 we need only 2 levels in subtree. With mininun index
tuple size (12 bytes) each page can have at maximum 675. Thus I think
default shared_buffers is enough for level step = 1.

Hundreds of buffer pins is still a lot. And with_level_step=2, the
number of pins required explodes to 675^2 = 455625.

Pinning a buffer that's already in the shared buffer cache is cheap, I
doubt you're gaining much by keeping the private hash table in front of
the buffer cache. Also, it's possible that not all of the subtree is
actually required during the emptying, so in the worst case pre-loading
them is counter-productive.

I believe it's enough
to add check we have sufficient shared_buffers, isn't it?

Well, what do you do if you deem that shared_buffers is too small? Fall
back to the old method? Also, shared_buffers is shared by all backends,
so you can't assume that you get to use all of it for the index build.
You'd need a wide safety margin.

I don't think you're benefiting at all from the buffering that BufFile does
for you, since you're reading/writing a full block at a time anyway. You
might as well use the file API in fd.c directly, ie.
OpenTemporaryFile/FileRead/**FileWrite.

BufFile is distributing temporary data through several files. AFAICS
postgres avoids working with files larger than 1GB. Size of tree buffers can
easily be greater. Without BufFile I need to maintain set of files manually.

Ah, I see. Makes sense.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#49Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#48)
Re: WIP: Fast GiST index build

On Thu, Jul 14, 2011 at 12:42 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Pinning a buffer that's already in the shared buffer cache is cheap, I
doubt you're gaining much by keeping the private hash table in front of the
buffer cache.

Yes, I see. Pinning a lot of buffers don't gives singnificant benefits but
produce a lot of problems. Also removing the hash table can simplify code.

Also, it's possible that not all of the subtree is actually required during

the emptying, so in the worst case pre-loading them is counter-productive.

What do you think about pre-fetching all of the subtree? It requires actual
loading of level_step - 1 levels of it. I some cases it still can be
counter-productive. But probably it is productive in average?

Well, what do you do if you deem that shared_buffers is too small? Fall
back to the old method? Also, shared_buffers is shared by all backends, so
you can't assume that you get to use all of it for the index build. You'd
need a wide safety margin.

I assumed to check if there are enough of shared_buffers before switching to
buffering method. But concurent backends makes this method unsafe.

There are other difficulties with concurrent backends: it would be nice
estimate usage of effective cache by other backeds before switching to
buffering method. If don't take care about it then we can don't switch to
buffering method which it can give significant benefit.

------
With best regards,
Alexander Korotkov.

#50Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#49)
Re: WIP: Fast GiST index build

Do you think using "rightlink" as pointer to parent page is possible during
index build? It would allow to simplify code significantly, because of no
more need to maintain in-memory structures for parents memorizing.

------
With best regards,
Alexander Korotkov.

#51Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#50)
Re: WIP: Fast GiST index build

On 14.07.2011 23:41, Alexander Korotkov wrote:

Do you think using "rightlink" as pointer to parent page is possible during
index build? It would allow to simplify code significantly, because of no
more need to maintain in-memory structures for parents memorizing.

I guess, but where do you store the rightlink, then? Would you need a
final pass through the index to fix all the rightlinks?

I think you could use the NSN field. It's used to detect concurrent page
splits, but those can't happen during index build, so you don't need
that field during index build. You just have to make it look like an
otherwise illegal NSN, so that it won't be mistaken for a real NSN after
the index is built. Maybe add a new flag to mean that the NSN is
actually invalid.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#52Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#51)
Re: WIP: Fast GiST index build

Fri, Jul 15, 2011 at 12:53 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 14.07.2011 23:41, Alexander Korotkov wrote:

Do you think using "rightlink" as pointer to parent page is possible
during
index build? It would allow to simplify code significantly, because of no
more need to maintain in-memory structures for parents memorizing.

I guess, but where do you store the rightlink, then? Would you need a final
pass through the index to fix all the rightlinks?

I think you could use the NSN field. It's used to detect concurrent page
splits, but those can't happen during index build, so you don't need that
field during index build. You just have to make it look like an otherwise
illegal NSN, so that it won't be mistaken for a real NSN after the index is
built. Maybe add a new flag to mean that the NSN is actually invalid.

Thank you for advice. But I didn't take into account that in this case I
need to update parent link in many pages(which might be not in cache) on
split. Seems that I still need to maintain some in-memory structures for
parent finding.

------
With best regards,
Alexander Korotkov.

#53Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#52)
1 attachment(s)
Re: WIP: Fast GiST index build

Hi!

New version of patch is attached. There are following changes.
1) Since proposed tchnique is not always a "fast" build, it was renamed
everywhere in the patch to "buffering" build.
2) Parameter "buffering" now has 3 possible values "yes", "no" and "auto".
"auto" means automatic switching from regular index build to buffering one.
Currently it just switch when index size exceeds maintenance_work_mem.
3) Holding of many buffers pinned is avoided.
4) Rebased with head.

TODO:
1) Take care about ordered datasets in automatic switching.
2) Take care about concurrent backends in automatic switching.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.7.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.7.0.patch.gzDownload
�s$Ngist_fast_build-0.7.0.patch�<�w���?��b����0������u�UZb������w8�4��B���1m�������I���.u����s����Q���p;I�u�]�AF~��<�$O��*��2T�4PQ���i�Z��K����{��+A?�<p��F�	"/�|)v�4��\���=��h���������nx�Q�xt��wS�6�LdD��T�fn,�����)v����h��2J�&b1{��$�|6T9��#�u�&dH�^�{�B���a����g4���x���V���Wgo��KQ�/�Z��d0���1{.�������e"�&�A��I����(U����c�I�w#
o�wA��1�W�����������w7��L&S�R�It�"�T�2����p�6�����~��'��tZ�����{tX�H�������=�{��Y"������WQ|4���r��:b��{�����|�a���ey�|���C2��Y=�^��gi	`������0��X������_���tC����?<r���f�_�c������Y�_��
 7Te�Ib�JgAR`��0���&n�H��bv@D�;n���?F]����h�4����qn���n��E��!sd�c���`����[�{-�iG�~�J�+�
�&�#9E�����|�v�o�HG6�.xP�F9_���a��`g�p%v���u2�n�3Jf� 3���Y�l�����x#i&��-=��J+Q��Q���:59DLC�xQ�$nm$��D�F.���i�j8n�s��|�YD�i�p�(>�I7��}t���8�1g�G���y�.�'�J?>��s�$�v^���{�N-��=~�3�nw���q��Z/�5hP3@{�{7dy���� mE_����pP�>�^�����4���Fp���o����E�<�����M�6��e��� 8k����=�S�������yi^�_����,i�J�p����$�fRx�6�������-��$[,T�&�n�9�3�������5h4[�;A�6u��qx*�B=�[�'C��G��H�1p���}�3�C`o��NU��9�gb��z�8����.���E������M���Z@����UE�.tw�~���*��t��������u##3�����Y�� 2�';�~U����sp����D2?�8�������A��g����t:�����A{�d�$���\Na���'�W�H�|�N^Ek�����V��O-���?iO�`~+�i����L+��j2	�N�1X������U����[i,%qK�]����)X�B7K�wA����n�n�������d����T���)�\��c�i^��D����&`��~�y��W^�W�(T����
��{�.�w��}r|r�"��3��}���X�)�0)��D,%x����I�R��b�Y�+�ZM�����>j�=x�]�P���P��=�(���u���;�������w������?>`�Z��l
�(�M$u�D�$�����Sb9��(*�����N�k��rN��l.���_�����a��;�2uD��d.���"����L!
d4��M�����m���C�D�$R@:\0��;-Z=�.�W�3�m��L/�Ks�����(��3P�X�1��P��S���p���g+�"��B�i�kU&� ����i	��y��@A���9�b0s�`1����8�]=�������Nd+
�RH6@A��Jy!�4�WU|F��rP�L�����OG���n_��
�7�dw��k	m��=s�=����W�NS����a �y	��\<�@�P�Q��)P0�$�C��Y�h���d.�%�A������0V������R��E���m
������D=o�����&(���(� �.t�1��mQ!~(r�^F�X�
�}�*����2���oXU=���g;o��������*S�Uy�)����h�_�~V��*�h��A�4�tZZB��J1?���F�I��'
�j.�����y�`�^��}k�E�51Q����s���\�+�I��>�b�.�����-t��+�^��
q��u�~+E��#��R'���h��K�1���d#�rv- �����7�\����=�K���h��y����Y��hV���
��bt��an�} 9�C(��ov��L���mX���e���\�tp� ���kf�aU�7(�9
0� < v����*.�N��d���/�����15T����E��@��O�F��B��s�B��[y��X���3����`�q�b�G�Y����aM!����~ml�@7R0��:��Q�I�Hf��o�\-hzb��pR���+,���
�7G�����=��������@�e��2�\����_��}������lTY�G�@��2�o,���4�O����[,����T"5�Q��2�.�>���g8U���C�fkSP^za�%il�Tv��������9����*l����8~U����}A���*X��;�+�,��W=V@�)
Y��F��`e��S�����y*}~����6���?=^��YG�y����MU��f��4;�`e�
j�7�aH��S���o��1���&������Ag�A��y�i,����@��q�MF^1T1��d�H�r`����3^��@��a��<��b�����c*�+�,�,�"S��V���v����s��,���m����YlhE�Pr�I
2&���P)�[�� �D3����-B���E�2vco�B4��E�y�e��eq���i)��xb#��qy�����u����d3L��Z[�uU���h���a�i�(��Hg%��	,C��?�^�ir�*�9l�xT����)�nBuJC�?3��*r��SP
�uG���f>I�����s#"9���!�V�->�
�������G��8�w��|�(s,�5:.�������I�9E%����|��"�$=JR�h)r];1a�
��<e?=W�K�yq���(�l@Xq����%RQ��s���2��SQ�j*��%AS!m������v�@�M���X�,���X�a�������;���cu�v%����rBjT��3i����i�\�z)�f��N���Vx`m�����D#�Rvm$�"uL]�X�kJ?�#���8{��
,��{�8Sh������sK-�i�U�`�vR�}GBQ,x�0Dq#�%p7��mv��/�����F���m"�S|�[���lN+��i��s_e�k���*��wd��0���|q�>a�'���"Uw5����o�r���b���dZA����"|:��6<�K���V99�a��fOa}��>+��i�p5S�?�N�Jo�wp�J�x>��!�^4Q�,d��4������#��
�Mx�����]�u�eh,'P� g�#�w�]�bL���I
>������F3H�,������������i��7N+��[���Xw9A�,m�`?�T��.1O���&����������`=��G�1Y�-�N��&�E�Yl�9�+�?U����<���OD�
1�L�����
}� @�lg��r�z3��-�@s�y~T����*'�
�$�2�m��O��t��[-�,�H`RZ���7����I�s;��{����h*G��o��vu��A��4�@����h��80��m!��I�Q��WS��A�9�i���O?��r��z|��F��%����,|haz��a��Z��)J-&?HA,fj��|EE,-;�b��V���f|���b�����7w
�s������3a�jC@���@��z���a}��Z��#��t�r'�Q�1���7ohVu�/t'�=�I���]
@9
�@���r�AEAy���*�j����x#^a!~�8�i!�T��P�v@4�� 
CJ�#D+W��%������h\&y�@�a]�)D�4u�X�\��^��;Mr&�#>�����!���
,��5�����1=w�-��z!h��2azB���q��!��-�7z�F��Z��u�;�sbMk����_�|o4�1b/���W#s��F�cq>�v+��|�M�<���`�fT�{7�����p0����p_�L��(	�����p4��F��+���mc[��*����q����B�s����t�>@q�o�����}�yl'}���W�����1���HYt�c<C%b\S��F��T�m���
��1�����E�
B.sF���I�By�dW��I���"i<!�x�-��h��HpOGr��d��w����
��^���n/�~r�B�	 �'���-e��`���Q��[1)s����d�Ox�,SV#�}�b��\>D(\C���E� W��[���]�%B�42n5������l�	�1H5��s�
��l��m2v���0#����Q=�1����!Z�-e��Xc*W
�\
F�@�a@Q�3.��gd�����Ffm;I�q���P��Uqrn9��0!.'���#��u�Z�4iR������*�d�L�&��{^J,?�7�`cN�C��X�!�k��R
/)0�]�u�%w�k����Z�C���XZop�c���o504��J
Zp�������u������@&
{/������6��<����?i>[,t"�f�6\�������B�)nE�;]��6�XK@w�r��*�f���p�����m���'`�K0u�\U�YG�C�j��s��(��7��@����x�<&
����J�Ik����4�TF�I�Qa�~F���puC��q�	�����c�Q�#}AQK����9� ��d�@��������$7E:�`o:+��7	(�����w	i�l����7.�`:��.����&�Je��^L�9����l�p*�A(�,-���	QK�

S\�:,���SV:����*
�g|�b2���O�,�r��C.�^"#r')nx3�A����b\�k_��#��O9��C{���������������_�_���U�ja��Ql����3���H�l���M�!z�m��[F���Y�FU�zC`��$�\�F�dz��.�����E�+�K��������s����c��h.���(���A�No�5NM��*���>��
�_vqEwQ�w9���$j���vQ(�����6W0��]��j���Xb���*�r���UEr�)i��3�]����)�����A���&UE��2����8�'���� d�Th15�4������`�S~���`/��;�Iqg��!g5s�)�g����d���;�,���R{�*K�$��HB������������)����td��K��'�RT.�=�*��;G��?��$1|�E�}�$���}��I�G����.���i���2kD��nx����=omr��vW)��E�3��I[�����El�to������5:S��A���k� �OI�_�8�G���Hv�/��J��]�D���b������J�[7d���0K�m��qsY+�?[oL���}��3���+�E�PQPo�Z����+#o��Z'[�����a���U���]
�U�h������^Q^�������R��������r1��*_N��;��G��u�.�_"�]/�����
����K3�����?��b��$�/
�O'����Tn2��� ��O�����cs�/�CggV������@0Z8��_0�z[�w}p'H��^$`���%C�N$ko��sl����P\\�������������^��_���W��%��H�7�/�Cu��25��T��j��� x�z�G�[���?�}{WG�����h{�8����$!8���p���,-!���������w��_U�^-�=����lCwUu=w��o��=8�{��w��w���m|�5��34J�(��w���~����������g�{��E��h���
�=e^�����UU���*z��W���m��&x[i���
C+��g����Q
L������6~�&[�g��c�jh�e���c�j��d5�X����UU������\�#B>�����q�E�������5�,e���d
q�a:��g�4]`�%��g��)��*��o���jVH�������8���~>eF��������� �4S������v�YM8��(H�7X0;�Xv/`Y��B�6��L��|"4^?o��f���i��jRo����&��<3A_�j>�9Y
�9Px�����	1(����$Q|�Q6�X[4������/�+��(\q�~�JO�'��${B�8;8g��=g��p�1����S�rE3�?P�����?���?�N��Gs\��`���d��,��:��.�32Wzn�hr��.�m��z�%��h���k�v��'���0j��q�uhl��\�;�d<i�����]��
`�WW���`33�:�Q��s�����[Z���QF]�>#���"��:��x�|��T��.$K0��8.���L��}$�39��2k�0�j��O��j(�bz3���/$��p�W:ly�)d�[�`�&�Nv2��������O��}5*6�!�����bAs_S�=5	����A��Y��`�%����2����hs@�e#��k�����-!���/�&�U4l��{4�����_\��2R���KF����
w��N*7�[3�X����1X�0p���)��5����=��6������VS��y����\}�i�x"?�a�zob�l�'�-�� P�U���0��e���I�[\m�p���j50=����fr��{=����v��f���v�.!��
'�P{I���<����N��s�So���n|������g��@��t:zw����oGU9f��,t��=�����`^�A�Gk����N�[����r>�jx����e_G'T1��J�t�I�%!����Q��
W@a��A^��.����%x����4+G%H�j(���dBa�`n��vn��������cQ��*j�Hne'����I���u�f����;��:<�����n�6�F,�V���.Z�(	BW/5�v	���
������k.����R����VzGQ�W���O�EA���m�.o=OT0u����q1]������)(~|g���\���Q;`����`�9���r�!�%��q����3�+k�B�	�K(���X	�������7c�@�V��$��C�>�m�X�d��.)�#�7��0c[>��>�j��z���-��G'=�j.�����k����b�O%��g�������7��bp+��P���R-U $RI�1U0D�[�TT;A��5��]@��c��q��l���/��w�����UZ�����C���X����B	<]'�����7>:/d�,$���������g����6d�}T�(���	"����������@@�(Q��!������!
Fdr��%�*�~�mq��Y9�� ��d/�Q���U��J�%����� Pp&�d�����Ch��L�+���f��n���������7��g�H������i�7x;�
��wt��������w�_a�����B"D�"�P���]��]p�1�IP�05�<G�R����w�|��1)[��X��eBT����0�����B�x���|i��L*]l���m��/��~
��e��O�������H/���!^=�����t��&<�M�#�1���m��j7�P! ]�|]Xf!��u#��G;��h�U�{�c�U$��%�o�GN$��G�7���u�:tYp��.��
�f�$�F�Z�@�?�K�����L��Hp`������E���AYV�*��rZ�*��H��,�����G�o3�.�tB��\E�' ��j��������<�NO��v<{�cX��,�0����g���96x�_��7}'����	�!�}��i���d^&[���H��p���������{���.�8����P�Jn�l�����3#|�~�R��8����(��j��n����V�]��J�Q��P|&H����D.�\�N�P��!�A3{5`�����'���wX��R-�sW�BC�W�6!?~	��vp���c��V�<�������+��nT������Re��
�W{/}�"�mz�%��f���]`����=�,%�(��1r���j���m���=&x�U��M�J_f��\tI�z�����]o��8T���4�;��)	+��I�{qu�67��������[x�?����|bVu��u};`��������O;����0�ie�j�=������h
����=
�J��v�,lZ>�����?���+&l�}B!��#'��8�y���$0��R��QI���|��H#V�O��N%H�L�v���Z��S*��`����� c�-����<08��Q�)�,�^����?�\-;�GXZ�r���k0]���=�y�X����)��{%S��Z����">}J��F����/�c������Tp?�G7o�\n����C�77�y���t��Y2vLCp�����,HN������)m��6l�I�VH����������3���w|.b2m�Z�/���2����	���]����~���+�"��+V0�?]E���:�����ZfyfE���S��v�e��h��!lk�C
sA�����HQ�����Aq�_/!h�����q�&GuI;�*gh2UT_��g�43d��h�<���lE%���Zk�c.�7��DK�9�![�SuX�s�������#�����������nZ�'<���f�	�o�7N}��sH�1�B����y���H�`5�<B�fX�
8��L�ZW�;[�e�:{v������z~���\w;�V���`C8b�p��!~���b�����B(m'z��z�{l�:8B���;�����6W����G�'�mKP
��+5#������Q�j|q'k~����������7�@S�J���,�T��G�'"���^CdNy=g<��V<S��f���6�`�)�F��'�����2��=s�V��h���]�[>���B����Z&F�����Y~��q������?K�3�+����B�$d�j�+l,q�rKl��'�~q���H����a�.������\���@]���j���H�`i��	s�	������q���x3(��$�x��{�E8qv}���	0�`�IWY�����1�z����Q`������P���!G%������P��v�`���Y�/��:�&�3)E�XV��V@���e�*MP5e#�
�pc���f��O��,�\L/����tG/�O��WU��&+���3�^c6��)�{Z�;�Zz��0�62l�-�����-���m��ju�)!zE�<���[��D�&��2c���G+|�Lo������o���X��pj��]`~��6	<��y����t8��1�/���g��������`�a��lS_�o����8���-�Os�P����e��q���b�H4!�F�g7�������pX��2(�H�+��)��F���
�:t����!@�1]�U����K\��@sR%��r2|��Q���Y{��DDL��Q��� �,gbM�8�����+�K���(H,���d�����;�	C!�{�`�e�4�\l�����/p��En�������^��J(C�����p�B��(P�K�,�R�p7`����Q�g�/����o������AS���0��{�� �Km�:���8�)a��\�UL�^O��cA��=�=����������&J�{��{�b�dry;����N3s���G�u����T���l�*���6���:T.�@J�a�/
m1��eAm��\E����$l�O���7�������6�$	G�[�"?>�����m�&	��`���t�����N�|�i�!������M�������m]���;����D���m�g�i��Z,�������:���}�[:i8x8����1����D�B�#>�,���Z���RM�X�^���������As^�s������(*��O�7myq���G�0�?r!��Zh���E�z������e	/�F�V�1I�T�)����E���y���X6�i�G��q��U.@�-l�4������F�x����>`#R��~u��F��)�ed�}b���i;���~9�`a�tg'�d�*r�{}
&Fg�,��t��\�#��"���r
�n��DD+�k~���D�3����K��`k���&��9l��wE�#m^/���SD��������og�����w�g��r��I��e�S����sM�-�A��7��\����qSV���G���U�xO��-�,R2�KH!I�[?H�.�����w�Z�
i�N���r�W�>~`�1�}��d<"N�����	'Pvm�{�C|i.��rF�15B��XX�^�
��/r�������>	!q��j0[���}�)���X<������Z{��x��ZzW��c��
e'�Bi?�C_�N�0���7�*��V��]��Qw ���bg��������6[j�X����SA�=g0���ng+���j���5�#�e�7 -�!.�U��N����1����������P��F�c�|�>_�;u�9d~,��P�o� �f�N���_�Z����K��)��qby���.�9�l���Y����[�����������+�;�t�5�&%��p�&�QT�����x��F/�����
�+Z�O]�e����9�T�r��E�'��G�>V�W������]R�nY�����=�&$)����L��!���G0[�A��ce���D��,�MJ;�F!�6w[5��<m�V���C�^����2yPu^���H.�-G�lCB�� 0�K����$G�]im!��%39�,������9 ���0d�1��r�}oi �����/��=s�z���z�{|��xB�~������T�����]�`e��*�}����J�V7��r���h��}�8w�BH���y�M!��u������@9�}=�������um�Q�0P����B&xrDJ��5p�@�E(�;��Z�������pfs/�@5,ey�,�M0���-����
w2�;C�h�,o]@�b��:��h��\�K�����^{02������_��T���S;����R�5�D����7�'�b�fD�����`|H��0:>�I
�d��#f���w[�vRZ��"1�i%��+�EW�����A>���P���E���8�!�KNQo��w���������}-��o���O���j��I��e#�,�mo��A������r�@E�#�t���r��\qW<�0�M
Dk��A$dN�&���UX�������e��R�[�,��X�/��t
k����q�����{X�Q7�\�ANWBK��^�-8?�������BV�-&V�A6�c+P{��:L��b���]��p��������JO�<�7�;�-���u���0�[`n�fI�����l��E'	�*/K�CbO�-�3a����/~>��6[����)���o���1�����&���HB[�1'�y~S��H@��7��i&�H,���:�a�����j?�O��)�!X
�,�Y-�?��8���=���C��Q����]�+���'��R�xa��y;�/5�1��Q���	aSp��Q
���$���z�	}��{Ck�iw�"b�3�\O���xejD�#X-�����X+C��MhZ��4-s'5q�Y�t�����<��pW	dHJd�zA+����N?�����YZ������c
�
�xS+(�H��
���tv3���J������E�N<9�����Fuv����e�2�r�P��z��_w�}V��d��9'�/�Yo�����5p\]U�Mu}2��mf��b��p�<<4�����r����C�	�A����	�����g��k�0�<)c�=u>�.��i�(W	�>5�$�	K��`]h�"�qK{A�����sis�����p
���I2��eSB�����&���f~��<V���^��PUb t��mt*j��Iqp��\���Sp�b��������2�Ol]$D��m.������s;�����j!�<b�M	���E3I�2����/C��#}���c[Qv��y}+�2��j����Gm}M�1�����e@:"��%�IeB^S�f9���
ex���N��u�P�XJ��OK�f����lM��Fp�V|w�Aza�8�|s���5��l�|��>(�<ER��d~���]>>�6�D&��V:�m�l��[�F��K!�����D^&�f������"�b�KC��fn=X��K�����6-��I���+�(�=�A�yV�������y��$i:��]pv��(�����kK"��X.L��a�����C�����-P��DM��C|5�C|5�C����!N5��C|�~9�-�w����9�7��ayYHx���_�h!����[�K8���L
�xd�sf,��e�u�|����C8��a���65X��[P+�t�������{�V�Ft��{�V�Mq���%W����c������0�z:�:�:�!��,OC�u�f�s�3=���|L�U���j=$\q���DC���,]@���#�|�Jv�0���9D�a/PV�EijY�}��e�W�J�Z/�y+@�d8j�J@
,������arf�l���L��_X�B�c����|6!|Z\��{�����-����K���uV({�O+����s?��4��:���3Kk�0��lz=�P���C��aN��/�$;����(�7HR�$0�i[l#|t
c������N����5�g�����J��� 
��"n��_b��{�={�����{EYu�	���&�!�g����S�4���K��s bo����0X��:�H�aS{*���7.e���
I0�!��n1�S
k.�Q	�Q -+��h���a������J��Xr�y'j�^U��E����-H�����kbK�o=@2�N!z�gto�H���<-�kW�~h�+�0���$����f���U�p�:��D��'�K	��
L0��y��������8����D����!���[G�"�e�r������il���z\w�\�������-�*>(��}w�&������(�c��r�������-�8���Q��
H��}����ox�`���@mU<�|��������s1v'��{B
��8�)#p'��F������Es�����E��G��cj>�ec���p���9:��x-��)����K1��og�j����,pdQa�*�Y=�%{Fw��d�8Q��q�1M�;�u6���.��)��T I3d��@�u��J{��>��v���^v�G���mv�hp��w�>�������������	�<� [�l�Y�v����a�Jb�(�7Q.#>���~�,��v<�����������G�j����������E=�2����w���/�8�Z���$�
9�tu7+.��C`�����.oy��u��������s���r?��Pb��<�����SS�wt������d�����{���D�A�R���/��"%�?U�q�a�f</Vf�?����F�V��^�mi|�pW>�V��e��(d��HZ[
y .J�`��V�g����W������c��g����F����,�V�f>]���9
�L-��II�?I�1�C	g�������!D�'���h�����/jn�"����tVA�(��/��$�
4Q�l�t��?=W�n�*:���"�������1Gk4z>bf��!4��x��5T����e�&�1�9�z��P
�x����JJ�|����9y}����'��^���h\6����T#��Qe���Y��$�
@�����������.���}`t��cAr�R
h�6;��;�m5����O�����|f���">�c�af�������=V4�lr�c�1ZX�2]zU�����]A�V�`x;���l�B��?�MWe�e�<>|M��8��9�������[<�)�1X��r�r�� ����t����������r�eB�7h����W�@m��|;�����"�/3�ec����=n��r�
�\�h�iI����x�������S~*�HKK�Gh^z~���3=g'�|Ju9w����an�x�r���*Ef8���k=~�p�X���P����Sg�������m(�v�6�c��:2�(����mw��MV��V;'��{qn�!�|����������.��6{�B'�&�������O�z����K����������������I��������u��(W%e���#��l^�%��p��p��'���6�kk�
���J�~��D��n2���+���u�����z^�6BHJC���*��{K0���?�G����Mp���]s������c�?�lvz����%��#�r������)��h��t���b��������5���l���H���HT;5O��<��5���8s�c��L;�	��m�1�174D{���@��d�����
�\�����<�j���p���p�f-����z�����[���E�[����*��I�VP.���b��K`424���jP{L�ED���J:)�mU������3j�2E�^E���]��(������=>�jj$�A�G�Q(q�h�g��CYZ�C��u�����nj�R�<H�Z-
S� U��9�"s�
��*(��TT^�'�/�@�U8e����|b���3�$Y)��$M�}-WE|S/�
����0�� ����J�uf�P��!T���y+�W�Dy+	�w�[b�E�����I�{,�hs�zc�?^�f�����q��f�x����?�����6'���#�IQ���u��])�Sz�+�y�6A��
-�0e���RO�@�<�F��j!`���k��g�w�*�&.����F>>XYY)6���
��|���V�m������d�A�{p�nX{�����`s��m=�>���/��^Qk���g&J�
��s.o�����e
���1�(��J��������W@��54k�)�|��7kO7���ud���Yy�������Y\���7����6������O�st�����2�������?�L�'U_�l��������;�����amS�w��H�����s^]������gq��{��E��w����Y�X����?����dj���>�
�c�Ee�*�������o6������Y�-~wTG]W�G��_�A���*������r����k�[������'���Y��/�r�������W��������x�1�*�+������|jn��(��"���Q,]���0�L�;R%y#)Ge�-u�o��������~������uI�Y�d���C�'Lib�fT����
M�*�����rp���!��������B������z�"����#�0){u9D!�d������Y^E-�E��[X	���$5A�������g�*f�2_%�#���{�i���t�u+�M@�c�k$��Z��6l�TR6
[#A�(���@:q�����8�
L!�F�r�7#�s��f:�O���O�1o(R��n��I9����s"�\0�&YY�Dk����EqG����-L`�b>Vb��o^������:�f�0�!�It^���1RZdc_,l��`�|aMKmc��V�������7��o���d�6.��A���x�/�&_�L����tR�-�3����a2Z�>@v�����67j5Y�b�s[���'U�.R��x�H��#I�hm��7�N��NW�"���8�*�w�5qJ��_���o�Zk@�NU'��xFT��i���/0Oc]O�>��y�i�w�L���S�~Y���^����.ifw�
Q8��� ��-� �d����{�������f�����?�8=2K�D��(y�:p��1���Z�������7��Iu����t��S�U���7G;��	�Q���<�8~t<�����
g�Uw��������+W7��%�c��[��_�V�ZH�Pc0����_�����q[��-��	O4��������x?l���J��0��EX"A�7��'YM������L��H�*4Y��_��j�~y����|���D� �s��>n�TD��|��-���;y����t��I}����w�Ry3��A%��L�^s�W�����X`���*H@��
��������4�B�M�S����Y���V����:?��a����$�`��<�u�����=���9����A
��f����3,)���oQB,��4��K7�A��,��?`k�l3��t�d���R3��.�w�gu����jZ�z�����{~.������x��D`#�&Z"���y��"x�)m��N[P|�*��Vx
�N��UP�-��ts�����^[_�*
��i��1��S�1j=�^	!�F�,���[��
q�;�M�I���|@9�}Y���a9����� B�,�G"��P�%<�"/�f���w�����i����]�E�O{��;/w_��kN��J���*N�A���}�e5����
S��{��.0N��P���5B���9�~
\\���3=��O�������)���)(|%��#���bP\��4�U������
�Se'��R�/@��z���XU�>���f;\�MM�l�t*��,'�G�u�z;GG�/����i���M�1�X�$�K<�ROi��$��>�V<�#�h��L!hm^Z-����I�3�j�Y��wOw���w��sa�E3��i�Orzj+����<��4"��m�y^���<:Xi��0�0Z��<t���7K�M}@t�3�q��h��H:(D��!`��/�J  ���="l�.g��m�����R#��0�<�!p/l��m�)x.;J3�~�����\KN;���8��xS��1�#Ix�����dCp>���E��vR��w>f�>�,_W�:�N��l2� �x��\�T�@*%�rW�<a���-hf����5�w��^d�ZM��Xp�e�C
���IM���}=��]�����D��o�o��wg�wg��`~�Z�*�h�Z���t�~�G�CpD;���@s�h�������h�`��	[�6?�u��#b�G2�~�����-��������4����
�������M�6�4~��A�F�LD��8�a��M�+ l;��K���.8Kz�w61)
U)��������d���l�U��%�JM,s{3uI6�D����Z�� W�g���{�;8u��Q��������C�����L�d���M
�&�}V��k:��2�\R�A?$���\T�8.�	_����5��)�'��%����A��������W�3%So�lz3�����)��E�63�X�'�&���0�f���H@:�d��[6��^Z���DL�w+�i�����
���z�c5�� XQ@����M�N���	a�c��2d}�3�TK}s}=�s�����<���v�<@X��@�@j�G�[������}���b����dn>�����gc�?j_�����Ph����F�m�������A����.m���D+� �z'8t"��+tbF���U���V���aU<N(��v���05>��E�,:]�
�HG��_������.R�I�h�#�|��?NW���T�#)~�w�~#1�����V�f�u���9�h�NC���KX������9�th����?��d
�IH��������1�r��i#%����,1rZ2����P��S�3��o"��=�T:��U#�q,��RU�6{X=k����|UW)ejC�2���;����D�1�N�d��xy%� ����2t�1��~�I��1D8�@*�BF����^�N�z_R�Y�Z|�s��qt�`��<6��6���e�����KL0���V.����_:0��IKJv6L��\6];�S���A^`�k��g5N���1����<5���jQ%�y9���c>J�d^�5e���8�A�V��V�[�H�m��9u�s�M�?b��������p%�/;�
,�Kx����BH���� sL�n8 [�#�q��#���G������$A��
6L`���^v)	`}Y]�<�L1	(q��jr��JB����H��^<����i�$C�����p��YBO��>I��������0PH�U�/6� �Y�#S�r:�%``����n����b����g�������:��a��>?���T���B�tOY:1���
GNL?E����I_m�,T<����$%��d'�"=�����iX��6-���_����0y}�����m�BR�4����iE�'T�;�����-�b�����J�j3q�
LQ\qv��Mv�(��]F�
�&u�\�e
����r��j�Aw����&L+u����� �N�����nl�Q��J���S�UDEO�e�m��z����1���l���\����8����T�n��A����-�<����#z�y�R��)bWZxV�Og>�f"w�}��_��������1����@���E�xK���V�2�I�@���-c;��A�w�d�?���q\�	�3!� �kf�$�C���E�6C�"BR��Jh�'���H _+���F�`��UmE��]\������i&��e/H}zn�+C�;S /�N��PW����)�9��~H�?�YJ�jL��� ����x�
q.�w�����'�+�9��8����x�47�p�	:+����fX���Fi$��-�0�hG�����5�w����7�8�^�F�*�d5�e��T4��$C&�o���8�Q��?���~s���xO��	M��������@�j)�
aY
��|�h(�p��vy��Z�4�tY�L\��P�T�����|4���mD:�F��;�:�[�E�����M����3+�3#��#�]J\�v��DY�>P��B�����Og�7����v�#�$�e%D�-�C�o��
��	�P,�������.����w^��w����L%����9��gh	�Y�i�>��hr�n��w��q��=�N��^����D5�-B������J�:>m����6���Q�0�Nj&Td0��@�:�L��������*K�UO)��9��Y������3���%ig��e�z�����K�0��L�T/��KB���2J}S(_��-/�y�S�qO{XJ
������VZE���9����t?�(���V��4����8>���q	�qe�+�Ph����I$�cR����V/���|�5�yw�z�T~����Q�_,�//m�b�_g%o�������o�Az� �('HE1F$�YXx�l���j�Xq�R�wq������5w �T��0h9�&y�'������t^����� �qB�2{~���$E�m����G|��5P �`/�9�����-VB�E^tD��_O�E������5`/�b�t��ex��ul;�C4����2[���w6)�P�=a�&��}.&w��:J��u��sI�h��~>e��L����wR��G��ZU��2���f�\�F4����-��8�V�����c�V�X��0;�~Z���v���\��u�}��o6�	V|!V��A`h�K���49��[��l�(���hBk������p����e)zs5T�{[o��x%���1~A����I�,��M��v�4����+�Dkp�[ZM���t�2
A����#�����9�n���4��6��.�]_l�b�[�2���s�0��	|����}H^"�Y��*4P�hWv�b��l�c+m�$����
�e
���{�o�b�DB_���
�7uI�{��
Q�.0��\���|����ly%��%����������M������ �'�Q��m>����/4�2\�Tt���%��D�j�_]]�C`��]j��*'���#_��������7�wU-5L������$�����x^�T
�"�������L�upc�6mS���(��XSL��Y/�����+�u]W�
a���[2x��������)�Q��z�c�*~BYq?T���C��gP�a��^�L�x���}m-��1�WW��M@c}����f���>�3����@a6��^���2
��s{gt3/������uI\�j.
,���
 6�e��?��w@�-����NZqg�fR�����d��s=��RW�R�Z+��>VZE������*4���!U��@���������R*���;|�-�P���xZ�xV����������>b�Z�����)�t*��K���!���^���a�4}^�/'�4�������HQZ7��� ��)	)9������\hU>Q�'����)
		p�[��ILz����<��F���Y�YQO=��FC&"��(a"bV�������N ���2�E���"���G?5!�^��xG/����:�YS6)������d\Z���r�
�����"���E���d�s��E ZeE8�<gjn)������)���
y]:�6�Kt�����E)C�Yw����.���y���S��R�Z�\�(���(�W�Th<����B��O�\��bp�(��^+<����an�%�]��N��[�������(�j�fD��j*�����O��
��,h����>>�a� �e�Q����>��knr�vbL��0dh��+��s~{m$�w���u�{pz�w3.�������zov~��������������n�"�9
���������a��r�����t_��VKE���G���5�Ij)�H��BE�b�_�7���L'A���4N�m�ON\^���NNwNw�C�oXs)���T�LZ�"5�be<��qm��n����5����;V-F���2��>��-E�q�������A�)^��&O�-��h-`*�k.%"����
|��AH����i����[$�����f�W����oO�{��@`we���/�Z��DCe�Em*�
u���hV nH�����ho��J~[�|���h��"5��������ch��+��} Y�p���$����������O�����(=��O{��>8dR���3i�s���# �K����%
�pK��8����}�3�������?�^�my�����8��i���X�W_t��(�h
>���������9�(������}�z.�6�j��P�������q�xDM��Q�������` �lDv�,�5�c�N5��;R���IXF����U
�x����ew(�p]uf�>�nF{�*d�P��J����y�#V�e�gUq����OT�e�
��>��N�Z�u� �A�ba��l+�j������"u���=����2�J���I$�$��V����p��EB��R2���m&��������D����Y"�Y{V�z��t�}&���}b#�g�N�/��'Q:������1��Tp�gS�!{����:�M���Z��\`I���eR&�7(�Rd-��:�����_��E��1�yW�%(4xv_�~~k�<>�:�7"�V�P��!���?��z��-D.s��1����H����y�L�j.�B�PU�e<���-dk��`oN��\�*E���<9����E����J�`��^���!GA�o���:�+�����O��0*m��B��CJ |}[�nf�CW��4Vjz@�[(����h�H'�f�xi����������0�XK���,�	�TOi���{���Z.�4��P;n�F������R�B
��P�m��UCL����m����u����&��,�@/������e��6d�&b$��e����h�B���v%����$�A��'>��f��������oY��$����ixq�������:��V��G'<���UIs!�)f	�::���	){�Tgx&�a�J���W���n�i(b_��heM����)%����]�]RM�������~���J���j����n���bi�&��)-�e���7
�{Z
�N�b��n�dX�2�\�!�M*C�?p�J��J�\������^pc6�MK���e��+��4o��p8�����}�����x���/�>]�����_���M���0�YN�y
n��#��������u~a�����q,?�eX�q�T�s��h��Mx�D��a�?�ou>t{2��j��|�[!tv+�x�B= �*�����6��S*������grJ����^����H�S��WE�`���N�w9���o\O�}6_dJ�b��v>S���ff��Y�z:q��c�������&�^=��%8h��Z+{�%t������c2���s��y�.g�$��N����7����Q�]O��C��m.��b�pr���C'F;z�{����kG�����7������T��H
��i�9#	c�8�!)��VsR/3��%�(��"{�Y=	K6�/0����~<�<�$�R`U�^
�@R����{TnV;w����0�Q���$,ZI{�pD�ZJ%�H��SCc�KA������63�������yZ�P������,�K��3��K��������R)5sp���P;z!{�fn=Ln!f�����l:s���5{q���"��G�}n���4�`�����������
��{��DxK����e����Lg��.��SP�+�����A�RG��'�'T+�|�!^���`�!�B�����.C��N�"�����et��n*������Z�M�%��va�!o��M��`xi��%��]�Xj�J#��
�����7	���{}1�V�w����qY���|��w����1�)���>Op���v����b�+�^;��B/m<_�\."M[�.z=dh
������
��O]4�r^�4~����Kqq�c�Oc�dM{g�h���������Z�L[w�_�NH�&�r���+�������D�r�����r�����������gn{(�Y
�P����#9��Z��N3|Rj{�7Y�����x�<W���dey&;_w�N���jp�n-d)�J��#Z��x�X�FV�������2����H���R��k�3�H���1-0��vbtc��k��	��K��gY�EESj����U `9j���������69�H�Hf�l7hg���l�h	v��f	�%d���t�+�Y���^���r�������XM(H���2sf��	��\y#�D���	��"�>�	�%�+�����	�N��R��
\i�`p�I���R���1���/�����N�L�����x� X�o~[;�w����o�,�����%��s�5Hg�����]x�P��~T��aL&���Y�%y&�o4�p��f��&Vn�J��5c-y,�sO����)C+dE��jpI�"�^oL9_�R)�!]�A&�H--*��bI1�nr��ed������&}tT����$�R��D?(���,H�P�O��py{
����GWO�h����D�b���}@�������-��dl��wIX�G*^cR~p�cM>������|��#��)5�H91�,XY�A���XZ���S�:zK����x�Z+nr������d��cy���!*��(u���M�w����2�"��]<�,ui�o�aPw�Qw����my������[�����^�=�j���N}���������[�\����7��g�����XP�yL1�1W���\��j��������.��?�K�m�A��a�JZ-�����H�~25���;��>za{������zW����"r-�}^�c��Nv�m��W,����]�T�����w����":�w�;�e�x��W���lj���Up�%5�R�\��i��d2NU�t�-� a�������S��D��S�9�:�v�K	:��
�y��xhv�}��P�_@����:d���C�.��u�w����q3N�Ht�9U7N�|�S���#���M�|�OxE�m����f�2���($�E�Z��h�!nTk��B��;�������C��5�>��5=�����~'�|Z8�4��(���O�����`����=�Kp-!���<������L��{��T���r��s����+)����y
��cq��D�����}}�1T�v����j��B��V5��e��"BE��������#�3JDNeX���^����KG����	w�J[U<g�K8��Yu3r�g�Uqd�eml�j�+���qU=U��������k���Wvj����2[����$I��f#&/,�8#���C����Rm�Og�y5�sk��8���dh�A
6�@Ai��lH�qQ�����$���	��S�5�8O!�1>y	<�Jw�Gh"�(�?�����.{�o�B��'����O��B0gzX���&���f��!\D�$��GZ6��RH^�)n��R���m��%zQ�+��ex'6�Y��i=t��Jw�d���c	X��}�WP��������A(zT'��=wp����b�FW4��Zh]�y��FWm>��e�t		��Ir1�LlWq�'$:�_ex�7J�V����T��;I�Nii�}(+�������k�*.?��n�X	�^�u6�,�Rj/\w^�f�.��.X�zq��*�w	�v�h�����0�3���l�����uM��S�9X22-;w�����t-s��Y� P���h���*,�	7Fv�P+���R����	�&D�NgqJ�D���_n���M�!r���������'���9O?��y�s
��+���\*��Z} a������ed0�������>C?��V�w>��o1�-
�d��rI��#����4k���a��P����6]*����-�������b��������h���[��:�AVF�������:'<�>7f�_>�t:��W�-���������s��p��fn�9����9��l�qf�J9n�1��(���y5^<X[[+��)����/|����7�>-��V��z�>z�{�������Ao���I�Aa�DQ�^����r*
��{�wOM������o��fP��l�8��.���p���{h����>"����f<����~�[���z��m�c��t��������������c7�	E��:�k�t�x����A���OwwzG�v^��p6���Gh���	�
�������y�?71�T�2��p��@��++��yE����Jf�\B��6
��^i�w�?��
����=�#3�Eyd'���G�����.Gw���z�-\���.��bAm����q��m��\s,�o��}s��^��]A{0���\A
��wrz�w������CHy��-~����.������>���B�c�#pg��
��4t�y�iL�j��C&��S������A-�>4=��{���t����cf�Nq�A�Tj�	�>c�6(3a���9�������a~���;�&A1�Db4}�	7\��6��{r��S�;S0E����������s�?)�K�m��
�`��I*��b�o-��mR�����;�=uo�0�Oin&����%4p��'o���6�u@M����������}�;���.]���������np=U��q5�����\�tG��qH1gX���#���r-����/�R������KZ\��dN��'���������0�`H0/�V�&�����Ws�zV�7�q�B���r����o����6�f���N��/�\�H��~�(�
���+��f+^����j�-n��2p������*�� �^"�[��(���I�&�~$.�/fQ*C�r��d�l�o�0-���WJ��Qz+�p���<���N���?��KY������AT�����We^�+<��+}F�0�����iG�������2M�0_��z����s����L/~�'a���^��=���/j������B��GDAXu�e�6G���H\��-T#���
_Eq�Q��kN��dub=���<���]"�0��A��XQA�/�j�8��S]�����
� (nIKbS��/�������l�� ='��z�z��E o����"�nI �FU����uFy8A�S�
b�W���i��W�5Jdy"Ku��P�XQ2�Vzu�9������d�4�)�_�S���ak}=��qQ���+�$a}���c$bn80iy�m�:�����Omj�������?\�� ���2Y��.�MEr�-I��]OT�=>����:O���#��PY��b�&�M�����m�!,��	�����U�9@1�W2j�Or
_}���[b�e��,�w_v�l~)<����9����um5�7���:nt3�D
�N���:p���z�����C1��*]���Y]Yh���"�E�33�-tQ}Y������ �����f��3<�����E�r���_?�>���aQ��(7X�����s��Ak ~^rI�lB%h^�����m�����=PG��[�A3���b���ez�>�������$�pT��$~�������ILu�������p �N�5��P���x�P�c�~������o��*��|.�B��A�8%s9�X.��+���PI~�V������
��)�{����h_����Lo��c'�7c�2�oZ/I� �����x�W��*�`>�h������znx��5obxb7'�W�qz�rJ��:�������_|b�2=;E~��Z~������%�@y��
��*�XxLx>e�� �zsC^}ycn�.�������2H����y91�|�����r��FR�oH��A�7Z>X7P��s��������|v���m���R����B�ONGI]0-�W��;0�i5�5By][*Y�4bF���t^/3��'/�tf?i�p����LNzP�F�>�(����O��Izj�d�RU��G����N��y
y����N�Wf��q�$3&j����x$���V#m���7�+	�r%ym8��2�@pmh�7�����u����Nw����1r+lE'r ����"l0pP]���I����.����5������P���+`\T��<��J���2���A���mH0~j��3��qDf���x�&��������'m+��|�$�K�V���:�OD#_8[y7D5q�� �����G&9��:���/2�Z���J����E����meD�r��*nWm
�pM�����h�e�Y��j:�:r�a��X��4%"���7��|J�)ty6>F�{�P�ELC}s��4�x����\��mcE:>
�����uVk�~������S��N�K���}�`��R���qr�V�g6�[�P���I������IQ~��w����bT`�d�
�T=���
|e��y�����7���=!��mi�)��D�cg�-���hl���=��zO�I���S���OY���UsF��$���#j76�F4,��S�Jp���� ��<�=~���Jy��d?�_�S�f����'����G����y��S"/UW�3�x�q8�e<$��V�L9���_�w������\�������A��]�9Sn���'���:i���O��~����>=/�g������_o~�������p�������V�x��C<�H�@��E]8��������q������q�������t�}f�������o��6���hX.���a�c��/�V�ZA!1A�9����x�������M�D!J9��c[��i�����,�K�V�"_(�^�����d�|��g����Gu��������)W��^���@h����s]��h��x��p~�5BQ9v��������)�P�S��3�����~����^�$	O����������o��~���r�0�	��b�}��BP�<�����k��6����6%&R���H��z����\�.�Y+-���\~R�kd��,]5�G��p��|��A)Aa��7���^���jd���$C�s�K{�v�k���q6������m�]�����}��~���^�m0�n�.{�flN�������a5"2�L����/f����Y|s"��_N\&ii{���k�����s�W���nS��v=g���^+hZ�&`�d+���=�r��N������4(#O�6��[�\�u+A�:��Z�H����N��Z��p���)X�
#54Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#53)
1 attachment(s)
Re: WIP: Fast GiST index build

Hi!

Patch with my try to detect ordered datasets is attached. The implemented
idea is desribed below.
Index tuples are divided by chunks of 128. On each chunk we measure how much
leaf pages where index tuples was inserted don't match those of previous
chunk. Based on statistics of several chunks we estimate distribution of
accesses between lead pages (exponential distribution law is accumed and
it's seems to be an error). After that we can estimate portion of index
tuples which can be processed without actual IO. If this estimate exceeds
threshold then we should switch to buffering build.
Now my implementation successfully detects randomly mixed datasets and well
ordered datasets, but it's seems to be too optimistic about intermediate
cases. I believe it's due to wrong assumption about distribution law.
Do you think this approach is acceptable? Probably there are some researches
about distribution law for such cases (while I didn't find anything relevant
in google scholar)?
As an alternative I can propose take into account actual average IO
operations per tuple rather then an estimate.

------
With best regards,
Alexander Korotkov.

On Mon, Jul 18, 2011 at 10:00 PM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

Show quoted text

Hi!

New version of patch is attached. There are following changes.
1) Since proposed tchnique is not always a "fast" build, it was renamed
everywhere in the patch to "buffering" build.
2) Parameter "buffering" now has 3 possible values "yes", "no" and "auto".
"auto" means automatic switching from regular index build to buffering one.
Currently it just switch when index size exceeds maintenance_work_mem.
3) Holding of many buffers pinned is avoided.
4) Rebased with head.

TODO:
1) Take care about ordered datasets in automatic switching.
2) Take care about concurrent backends in automatic switching.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.8.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.8.0.patch.gzDownload
#55Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#54)
Re: WIP: Fast GiST index build

On 22.07.2011 12:38, Alexander Korotkov wrote:

Patch with my try to detect ordered datasets is attached. The implemented
idea is desribed below.
Index tuples are divided by chunks of 128. On each chunk we measure how much
leaf pages where index tuples was inserted don't match those of previous
chunk. Based on statistics of several chunks we estimate distribution of
accesses between lead pages (exponential distribution law is accumed and
it's seems to be an error). After that we can estimate portion of index
tuples which can be processed without actual IO. If this estimate exceeds
threshold then we should switch to buffering build.
Now my implementation successfully detects randomly mixed datasets and well
ordered datasets, but it's seems to be too optimistic about intermediate
cases. I believe it's due to wrong assumption about distribution law.
Do you think this approach is acceptable? Probably there are some researches
about distribution law for such cases (while I didn't find anything relevant
in google scholar)?

Great! It would be nice to find a more scientific approach to this, but
that's probably fine for now. It's time to start cleaning up the patch
for eventual commit.

You got rid of the extra page pins, which is good, but I wonder why you
still pre-create all the GISTLoadedPartItem structs for the whole
subtree in loadTreePart() ? Can't you create those structs on-the-fly,
when you descend the tree? I understand that it's difficult to update
all the parent-pointers as trees are split, but it feels like there's
way too much bookkeeping going on. Surely it's possible to simplify it
somehow..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#56Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#55)
1 attachment(s)
Re: WIP: Fast GiST index build

On 25.07.2011 21:52, Heikki Linnakangas wrote:

On 22.07.2011 12:38, Alexander Korotkov wrote:

Patch with my try to detect ordered datasets is attached. The implemented
idea is desribed below.
Index tuples are divided by chunks of 128. On each chunk we measure
how much
leaf pages where index tuples was inserted don't match those of previous
chunk. Based on statistics of several chunks we estimate distribution of
accesses between lead pages (exponential distribution law is accumed and
it's seems to be an error). After that we can estimate portion of index
tuples which can be processed without actual IO. If this estimate exceeds
threshold then we should switch to buffering build.
Now my implementation successfully detects randomly mixed datasets and
well
ordered datasets, but it's seems to be too optimistic about intermediate
cases. I believe it's due to wrong assumption about distribution law.
Do you think this approach is acceptable? Probably there are some
researches
about distribution law for such cases (while I didn't find anything
relevant
in google scholar)?

Great! It would be nice to find a more scientific approach to this, but
that's probably fine for now. It's time to start cleaning up the patch
for eventual commit.

You got rid of the extra page pins, which is good, but I wonder why you
still pre-create all the GISTLoadedPartItem structs for the whole
subtree in loadTreePart() ? Can't you create those structs on-the-fly,
when you descend the tree? I understand that it's difficult to update
all the parent-pointers as trees are split, but it feels like there's
way too much bookkeeping going on. Surely it's possible to simplify it
somehow..

That was a quite off-the-cuff remark, so I took the patch and culled out
loaded-tree bookkeeping. A lot of other changes fell off from that, so
it took me quite some time to get it working again, but here it is. This
is a *lot* smaller patch, although that's partly explained by the fact
that I left out some features: prefetching and the neighbor relocation
code is gone.

I'm pretty exhausted by this, so I just wanted to send this out without
further analysis. Let me know if you have questions on the approach
taken. I'm also not sure how this performs compared to your latest
patch, I haven't done any performance testing. Feel free to use this as
is, or as a source of inspiration :-).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

gist-fast-build-v0.8-lobotomized-1.patchtext/x-diff; name=gist-fast-build-v0.8-lobotomized-1.patchDownload
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 4657425..d829243 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -30,6 +30,9 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+
+static void validateBufferingOption(char *value);
+
 /*
  * Contents of pg_class.reloptions
  *
@@ -66,6 +69,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"neighborrelocation",
+			"Enables relocation of index tuples into neighbor node buffers in GiST index buffering build",
+			RELOPT_KIND_GIST
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -159,6 +170,22 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
 		}, -1, 0, 2000000000
 	},
+	{
+		{
+			"levelstep",
+			"Level step in GiST index buffering build",
+			RELOPT_KIND_GIST
+		},
+		-1, 1, 100
+	},
+	{
+		{
+			"buffersize",
+			"Buffer size in GiST index buffering build",
+			RELOPT_KIND_GIST
+		},
+		-1, 1, 1000000000
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -219,6 +246,17 @@ static relopt_real realRelOpts[] =
 
 static relopt_string stringRelOpts[] =
 {
+	{
+		{
+			"buffering",
+			"Enables buffering build for this GiST index",
+			RELOPT_KIND_GIST
+		},
+		4,
+		false,
+		validateBufferingOption,			
+		"auto"
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1282,3 +1320,22 @@ tablespace_reloptions(Datum reloptions, bool validate)
 
 	return (bytea *) tsopts;
 }
+
+/*
+ * Validator for "buffering" option of GiST indexed. Allows only "on", "off" and
+ * "auto" values.
+ */
+static void
+validateBufferingOption(char *value)
+{
+	if (!value ||
+		(
+			strcmp(value, "on") &&
+			strcmp(value, "off") &&
+			strcmp(value, "auto")
+		)
+	)
+	{
+		elog(ERROR, "Only \"on\", \"off\" and \"auto\" values are available for \"buffering\" option.");
+	}
+}
diff --git a/src/backend/access/gist/Makefile b/src/backend/access/gist/Makefile
index f8051a2..cc9468f 100644
--- a/src/backend/access/gist/Makefile
+++ b/src/backend/access/gist/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = gist.o gistutil.o gistxlog.o gistvacuum.o gistget.o gistscan.o \
-       gistproc.o gistsplit.o
+       gistproc.o gistsplit.o gistbuild.o gistbuildbuffers.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 2d78dcb..45f8fc9 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -24,6 +24,7 @@ The current implementation of GiST supports:
   * provides NULL-safe interface to GiST core
   * Concurrency
   * Recovery support via WAL logging
+  * Buffering build algorithm
 
 The support for concurrency implemented in PostgreSQL was developed based on
 the paper "Access Methods for Next-Generation Database Systems" by
@@ -31,6 +32,12 @@ Marcel Kornaker:
 
     http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz
 
+Buffering build algorithm for GiST was developed based on the paper "Efficient
+Bulk Operations on Dynamic R-trees" by Lars Arge, Klaus Hinrichs, Jan Vahrenhold
+and Jeffrey Scott Vitter.
+
+    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9894&rep=rep1&type=pdf
+
 The original algorithms were modified in several ways:
 
 * They had to be adapted to PostgreSQL conventions. For example, the SEARCH
@@ -278,6 +285,113 @@ would complicate the insertion algorithm. So when an insertion sees a page
 with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
 crashed in the middle to completion by adding the downlink in the parent.
 
+Buffering build algorithm
+--------------------
+
+In buffering build algorithm levels are numbering upwards and leaf pages level
+has number zero. An example is given on the picture below. Such numbering is
+used in order to make pages save it's level number during all life-time even if
+root splits.
+
+Level                    Tree
+
+3                         *
+                      /       \
+2                *                 *
+              /  |  \           /  |  \
+1          *     *     *     *     *     *
+          / \   / \   / \   / \   / \   / \
+0        o   o o   o o   o o   o o   o o   o
+
+* - internal page
+o - leaf page
+
+In buffering build algorithm additional buffers are associated with pages. Leaf
+pages never have buffers. Internal pages have buffers with some level step.
+I.e. pages on levels level_step*i (i >=1) have buffers. If level_step = 1 then
+each internal page have buffer. 
+
+Level        Tree (level_step = 1)                Tree (level_step = 2)   
+                                        
+3                      *(b)                                  *
+                   /       \                             /       \
+2             *(b)              *(b)                *(b)              *(b)
+           /  |  \           /  |  \             /  |  \           /  |  \
+1       *(b)  *(b)  *(b)  *(b)  *(b)  *(b)    *     *     *     *     *     *
+       / \   / \   / \   / \   / \   / \     / \   / \   / \   / \   / \   / \
+0     o   o o   o o   o o   o o   o o   o   o   o o   o o   o o   o o   o o   o
+
+(b) - buffer
+
+Each buffer can be in one of following states:
+1) Append. New tuples are being added to the buffer. Last page of the buffer is kept in main memory
+2) Flush. Tuples are being read from the buffer. Page containing last returned tuple is kept in main memory.
+
+Buffer goes from append mode to flush mode on the first call to popItupFromNodeBuffer(), and from flush mode to append mode when the buffer is empty.
+
+When index tuple is inserting, it's first path can end in following points:
+1) Leaf page if no levels has buffers, i.e. root level <= level_step.
+2) Buffer of root page if root page has a buffer.
+3) Buffer of topmost level page if root page doesn't have a buffer.
+
+New index tuples are processing until root level buffer (or buffer of topmost 
+level page) will be filled at half. When some buffer is filled at halt then
+the process of it's emptying is starting.
+
+Buffer emptying process means that index tuples from buffer are moving into
+underlying buffers(if any) or leaf pages. For buffer emptying to another buffers
+following items should be loaded into main memory:
+1) Buffer itself should be completely loaded
+2) Underlying buffers should be loaded for append
+3) Page associated with buffer
+4) Pages between buffer and underlying buffers if level_step != 1 (note that 
+   pages associated with underlying buffers aren't required to be loaded)
+For emptying to leaf pages list of those items is following
+1) Buffer itself should be completely loaded
+2) Page associated with buffer
+3) Pages between buffer and leaf pages if level_step != 1 
+4) Leaf pages
+Illustration of this requirements is given below.
+
+   Buffer emptying to another buffers    Buffer emptying to leaf pages
+
+                 +(cb)                                 +(cb)                  
+              /     \                                /     \                    
+          +             +                        +             +            
+        /   \         /   \                    /   \         /   \              
+      *(ab)   *(ab) *(ab)   *(ab)            x       x     x       x  
+
++    - loaded into main memory internal page
+x    - loaded into main memory leaf page
+*    - not loaded into main memory internal page
+(cb) - completely loaded buffer
+(ab) - loaded for append buffer
+
+One buffer emptying process can trigger another buffer emptying processes. 
+Buffer emptying stack data structure is the data structure responsible for
+sequence of buffer emptying. Each node buffer which is half filled should be
+inserted into buffer emptying stack.
+
+When we're moving from buffer emptying on higher level to the buffer emptying
+on lower level, loaded part of tree (only pages of tree not the buffers) are
+remained in the main memory. Tree parts stack is the data structure which 
+represents hierarchy of loaded tree parts.
+
+If split occurs on the page which have a buffer then index tuples are
+relocating. When neighborrelocation is off index tuples are relocating between 
+buffers of pages produces by split using penalty method. This method was
+proposed in the original paper. For for reasons of index quality improvement
+another method of relocation was implemented. When neighborrelocation is on
+index tuples are relocation into both buffers of pages produces by split and
+buffers on neighbor pages (pages with same parent). This method uses more CPU
+and IO but improves index quality.
+
+When all index tuples are inserted there are still some index tuples in buffers.
+At this moment final buffer emptying starts. Each level have a list of non-empty
+buffers. Final emptying contain loop over all tree levels starting from topmost.
+On each levels all it's buffers are sequentially emptying until all buffers of
+the level are empty. Since no index tuples move upwards during buffer emptying
+all the buffers are empty when final emptying are finished.
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 4fc7a21..278de32 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -24,38 +24,16 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
-/* Working state for gistbuild and its callback */
-typedef struct
-{
-	GISTSTATE	giststate;
-	int			numindexattrs;
-	double		indtuples;
-	MemoryContext tmpCtx;
-} GISTBuildState;
-
 /* A List of these is used represent a split-in-progress. */
 typedef struct
 {
 	Buffer		buf;			/* the split page "half" */
 	IndexTuple	downlink;		/* downlink for this half. */
+    bool        release;
 } GISTPageSplitInfo;
 
 /* non-export function prototypes */
-static void gistbuildCallback(Relation index,
-				  HeapTuple htup,
-				  Datum *values,
-				  bool *isnull,
-				  bool tupleIsAlive,
-				  void *state);
-static void gistdoinsert(Relation r,
-			 IndexTuple itup,
-			 Size freespace,
-			 GISTSTATE *GISTstate);
 static void gistfixsplit(GISTInsertState *state, GISTSTATE *giststate);
-static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
-				 GISTSTATE *giststate,
-				 IndexTuple *tuples, int ntup, OffsetNumber oldoffnum,
-				 Buffer leftchild);
 static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
 				GISTSTATE *giststate, List *splitinfo);
 
@@ -89,138 +67,6 @@ createTempGistContext(void)
 }
 
 /*
- * Routine to build an index.  Basically calls insert over and over.
- *
- * XXX: it would be nice to implement some sort of bulk-loading
- * algorithm, but it is not clear how to do that.
- */
-Datum
-gistbuild(PG_FUNCTION_ARGS)
-{
-	Relation	heap = (Relation) PG_GETARG_POINTER(0);
-	Relation	index = (Relation) PG_GETARG_POINTER(1);
-	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
-	IndexBuildResult *result;
-	double		reltuples;
-	GISTBuildState buildstate;
-	Buffer		buffer;
-	Page		page;
-
-	/*
-	 * We expect to be called exactly once for any index relation. If that's
-	 * not the case, big trouble's what we have.
-	 */
-	if (RelationGetNumberOfBlocks(index) != 0)
-		elog(ERROR, "index \"%s\" already contains data",
-			 RelationGetRelationName(index));
-
-	/* no locking is needed */
-	initGISTstate(&buildstate.giststate, index);
-
-	/* initialize the root page */
-	buffer = gistNewBuffer(index);
-	Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
-	page = BufferGetPage(buffer);
-
-	START_CRIT_SECTION();
-
-	GISTInitBuffer(buffer, F_LEAF);
-
-	MarkBufferDirty(buffer);
-
-	if (RelationNeedsWAL(index))
-	{
-		XLogRecPtr	recptr;
-		XLogRecData rdata;
-
-		rdata.data = (char *) &(index->rd_node);
-		rdata.len = sizeof(RelFileNode);
-		rdata.buffer = InvalidBuffer;
-		rdata.next = NULL;
-
-		recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_CREATE_INDEX, &rdata);
-		PageSetLSN(page, recptr);
-		PageSetTLI(page, ThisTimeLineID);
-	}
-	else
-		PageSetLSN(page, GetXLogRecPtrForTemp());
-
-	UnlockReleaseBuffer(buffer);
-
-	END_CRIT_SECTION();
-
-	/* build the index */
-	buildstate.numindexattrs = indexInfo->ii_NumIndexAttrs;
-	buildstate.indtuples = 0;
-
-	/*
-	 * create a temporary memory context that is reset once for each tuple
-	 * inserted into the index
-	 */
-	buildstate.tmpCtx = createTempGistContext();
-
-	/* do the heap scan */
-	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
-								   gistbuildCallback, (void *) &buildstate);
-
-	/* okay, all heap tuples are indexed */
-	MemoryContextDelete(buildstate.tmpCtx);
-
-	freeGISTstate(&buildstate.giststate);
-
-	/*
-	 * Return statistics
-	 */
-	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
-
-	result->heap_tuples = reltuples;
-	result->index_tuples = buildstate.indtuples;
-
-	PG_RETURN_POINTER(result);
-}
-
-/*
- * Per-tuple callback from IndexBuildHeapScan
- */
-static void
-gistbuildCallback(Relation index,
-				  HeapTuple htup,
-				  Datum *values,
-				  bool *isnull,
-				  bool tupleIsAlive,
-				  void *state)
-{
-	GISTBuildState *buildstate = (GISTBuildState *) state;
-	IndexTuple	itup;
-	MemoryContext oldCtx;
-
-	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
-
-	/* form an index tuple and point it at the heap tuple */
-	itup = gistFormTuple(&buildstate->giststate, index,
-						 values, isnull, true /* size is currently bogus */ );
-	itup->t_tid = htup->t_self;
-
-	/*
-	 * Since we already have the index relation locked, we call gistdoinsert
-	 * directly.  Normal access method calls dispatch through gistinsert,
-	 * which locks the relation for write.	This is the right thing to do if
-	 * you're inserting single tups, but not when you're initializing the
-	 * whole index at once.
-	 *
-	 * In this path we respect the fillfactor setting, whereas insertions
-	 * after initial build do not.
-	 */
-	gistdoinsert(index, itup,
-			  RelationGetTargetPageFreeSpace(index, GIST_DEFAULT_FILLFACTOR),
-				 &buildstate->giststate);
-
-	buildstate->indtuples += 1;
-	MemoryContextSwitchTo(oldCtx);
-	MemoryContextReset(buildstate->tmpCtx);
-}
-
-/*
  *	gistbuildempty() -- build an empty gist index in the initialization fork
  */
 Datum
@@ -275,7 +121,6 @@ gistinsert(PG_FUNCTION_ARGS)
 	PG_RETURN_BOOL(false);
 }
 
-
 /*
  * Place tuples from 'itup' to 'buffer'. If 'oldoffnum' is valid, the tuple
  * at that offset is atomically removed along with inserting the new tuples.
@@ -608,7 +453,7 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
  * this routine assumes it is invoked in a short-lived memory context,
  * so it does not bother releasing palloc'd allocations.
  */
-static void
+BlockNumber
 gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 {
 	ItemId		iid;
@@ -617,6 +462,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 	GISTInsertStack *stack;
 	GISTInsertState state;
 	bool		xlocked = false;
+	BlockNumber	leafBlocknum = InvalidBlockNumber;
 
 	memset(&state, 0, sizeof(GISTInsertState));
 	state.freespace = freespace;
@@ -841,7 +687,8 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			}
 
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
-
+			
+			leafBlocknum = stack->blkno;
 			gistinserttuples(&state, stack, giststate, &itup, 1,
 							 InvalidOffsetNumber, InvalidBuffer);
 			LockBuffer(stack->buffer, GIST_UNLOCK);
@@ -852,6 +699,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			break;
 		}
 	}
+	return leafBlocknum;
 }
 
 /*
@@ -1183,7 +1031,7 @@ gistfixsplit(GISTInsertState *state, GISTSTATE *giststate)
  * to hold an exclusive lock on state->stack->buffer, but if we had to split
  * the page, it might not contain the tuple we just inserted/updated.
  */
-static bool
+bool
 gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 				 GISTSTATE *giststate,
 				 IndexTuple *tuples, int ntup, OffsetNumber oldoffnum,
@@ -1414,6 +1262,7 @@ initGISTstate(GISTSTATE *giststate, Relation index)
 		else
 			giststate->supportCollation[i] = DEFAULT_COLLATION_OID;
 	}
+	giststate->gfbb = NULL;
 }
 
 void
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
new file mode 100644
index 0000000..8543033
--- /dev/null
+++ b/src/backend/access/gist/gistbuild.c
@@ -0,0 +1,1173 @@
+/*-------------------------------------------------------------------------
+ *
+ * gistbuild.c
+ *	  build algorithm for GiST indexes implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/gist/gistbuild.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/gist_private.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "miscadmin.h"
+#include "optimizer/cost.h"
+#include "storage/bufmgr.h"
+#include "storage/indexfsm.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+#define LEAF_PAGES_STATS_STEP 128
+#define LEAF_PAGES_STATS_COUNT 16
+
+
+/* Working state for gistbuild and its callback */
+typedef struct
+{
+	GISTSTATE		giststate;
+	int				numindexattrs;
+	int64			indtuples;
+	/*
+	 * Buffering build mode. Possible values:
+	 * 'y' - we are in buffering build mode.
+	 * 'a' - we are now in regular build mode, but can switch to buffering 
+	 *       build mode when we decide to.
+	 * 'n' - we are in regular build mode and aren't going to switch.
+	 */
+	char			bufferingMode;
+	MemoryContext	tmpCtx;
+	/* Tracking statistics about last accessed leaf pages */
+	HTAB		   *leafPagesTab;
+	int				nonHitLeafPagesStats[LEAF_PAGES_STATS_COUNT];
+	int				nonHitLeafPagesStatsIndex;
+} GISTBuildState;
+
+typedef struct
+{
+	BlockNumber		blockNumber;
+	int64			lastTupleNumber;
+} LeafPageInfo;
+
+
+static void gistBufferingBuildInsert(Relation index, IndexTuple itup, 
+						 GISTBuildState *buildstate);
+static void gistBuildCallback(Relation index,
+				  HeapTuple htup,
+				  Datum *values,
+				  bool *isnull,
+				  bool tupleIsAlive,
+				  void *state);
+
+static bool initBuffering(GISTBuildState *buildstate, Relation index);
+static bool bufferingbuildinsert(GISTInsertState *state, 
+				GISTLoadedPartItem *item,
+				GISTSTATE *giststate,
+				IndexTuple *itup, int ntup, 
+				OffsetNumber oldoffnum);
+static int bufferlevel_cmp(const void *a, const void *b);
+
+static void gistFindCorrectParent(GISTSTATE *giststate, Relation r, GISTLoadedPartItem *child);
+
+#ifdef GIST_DEBUG
+#include "utils/geo_decls.h"
+static void gist_dumptree(Relation r, int level, BlockNumber blk, OffsetNumber coff, BOX *downlink, StringInfo out, int maxlevel);
+static int gist_counttups(Relation r, BlockNumber blk);
+static int gist_count_tuples_in_buffers(GISTBuildBuffers *gfbb);
+#endif
+
+/*
+ * Index tuple insert function of buffering build algorithm. In simpler than
+ * regular insert function in the fact that it don't takes care about
+ * concurrency. It invokes buffer relocation function when it splits page. Also
+ * it take several oldoffnums as a parameter because buffer relocation can alter
+ * a number of parent index tuples.
+ */
+static bool
+bufferingbuildinsert(GISTInsertState *state, 
+					 GISTLoadedPartItem *path,
+					 GISTSTATE *giststate,
+					 IndexTuple *itup, int ntup, 
+					 OffsetNumber oldoffnum)
+{
+	GISTBuildBuffers *gfbb = giststate->gfbb;
+	Buffer		buffer = ReadBuffer(state->r, path->blkno);
+	Page		page = (Page) BufferGetPage(buffer);
+	bool		is_leaf = (GistPageIsLeaf(page)) ? true : false;
+	int			i;
+	bool		is_split;
+
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+
+	/* Remove old index tuple if needed */
+	if (OffsetNumberIsValid(oldoffnum))
+		PageIndexMultiDelete(page, &oldoffnum, 1);
+	
+	/* Check if there is enough space for insertion */
+	is_split = gistnospace(page, itup, ntup, 
+						   InvalidOffsetNumber, state->freespace);
+
+	if (is_split)
+	{
+		/* no space for insertion */
+		IndexTuple *itvec;
+		int			tlen;
+		SplitedPageLayout *dist = NULL,
+				   *ptr;
+		SplitedPageLayout rootpg;
+		BlockNumber blkno = BufferGetBlockNumber(buffer);
+		BlockNumber oldrlink = InvalidBlockNumber;
+		bool		is_rootsplit;
+		XLogRecPtr	recptr;
+
+		is_rootsplit = (blkno == GIST_ROOT_BLKNO);
+		
+		/*
+		 * Form index tuples vector to split. Old tuple was already removed
+		 * from the vector.
+		 */
+		itvec = gistextractpage(page, &tlen);
+		itvec = gistjoinvector(itvec, &tlen, itup, ntup);
+		dist = gistSplit(state->r, page, itvec, tlen, giststate);
+
+		/*
+		 * Set up pages to work with. Allocate new buffers for all but the
+		 * leftmost page. The original page becomes the new leftmost page, and
+		 * is just replaced with the new contents.
+		 *
+		 * For a root-split, allocate new buffers for all child pages, the
+		 * original page is overwritten with new root page containing
+		 * downlinks to the new child pages.
+		 */
+		ptr = dist;
+		if (!is_rootsplit)
+		{
+			oldrlink = GistPageGetOpaque(page)->rightlink;
+			dist->buffer = buffer;
+			dist->block.blkno = BufferGetBlockNumber(buffer);
+			dist->page = PageGetTempPageCopySpecial(BufferGetPage(buffer));
+
+			/* clean all flags except F_LEAF */
+			GistPageGetOpaque(dist->page)->flags = (is_leaf) ? F_LEAF : 0;
+
+			ptr = ptr->next;
+		}
+		for (; ptr; ptr = ptr->next)
+		{
+			/* Allocate new page */
+			ptr->buffer = gistNewBuffer(state->r);
+			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
+			ptr->page = BufferGetPage(ptr->buffer);
+			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
+		}		
+
+		/*
+		 * Now that we know which blocks the new pages go to, set up downlink
+		 * tuples to point to them.
+		 */
+		for (ptr = dist; ptr; ptr = ptr->next)
+		{
+			ItemPointerSetBlockNumber(&(ptr->itup->t_tid), ptr->block.blkno);
+			GistTupleSetValid(ptr->itup);
+		}
+
+		{
+			StringInfoData sb;
+			initStringInfo(&sb);
+			for (ptr = dist; ptr; ptr = ptr->next)
+				appendStringInfo(&sb, "%u ", ptr->block.blkno);
+		}
+
+		if (is_rootsplit)
+		{
+			/*
+			 * Adjust the top element in the insert stacks for the new root
+			 * page.
+			 */
+			GISTLoadedPartItem *oldroot = gfbb->rootitem;
+
+			gfbb->rootitem = (GISTLoadedPartItem *) MemoryContextAlloc(gfbb->context,
+																	   sizeof(GISTLoadedPartItem));
+			gfbb->rootitem->parent = NULL;
+			gfbb->rootitem->blkno = GIST_ROOT_BLKNO;
+			gfbb->rootitem->downlinkoffnum = InvalidOffsetNumber;
+			gfbb->rootitem->level = oldroot->level + 1;
+
+			oldroot->parent = gfbb->rootitem;
+			oldroot->blkno = dist->next->block.blkno;
+			oldroot->downlinkoffnum = InvalidOffsetNumber;
+		}
+
+		/* Maintain node buffers and loaded tree parts on split */
+		relocateBuildBuffersOnSplit(giststate->gfbb, giststate, state->r,
+									path, buffer, dist);
+
+		/*
+		 * If this is a root split, we construct the new root page with the
+		 * downlinks here directly, instead of do recursive call  for their
+		 * insertion. Add the new root page to the list along with the child
+		 * pages.
+		 */
+		if (is_rootsplit)
+		{
+			IndexTuple *downlinks;
+			int			ndownlinks = 0;
+			int			i;
+
+			rootpg.buffer = buffer;
+			rootpg.page = PageGetTempPageCopySpecial(
+				BufferGetPage(rootpg.buffer));
+			GistPageGetOpaque(rootpg.page)->flags = 0;
+
+			/* Prepare a vector of all the downlinks */
+			for (ptr = dist; ptr; ptr = ptr->next)
+				ndownlinks++;
+			downlinks = palloc(sizeof(IndexTuple) * ndownlinks);
+			for (i = 0, ptr = dist; ptr; ptr = ptr->next)
+				downlinks[i++] = ptr->itup;
+
+			rootpg.block.blkno = GIST_ROOT_BLKNO;
+			rootpg.block.num = ndownlinks;
+			rootpg.list = gistfillitupvec(downlinks, ndownlinks,
+										  &(rootpg.lenlist));
+			rootpg.itup = NULL;
+
+			rootpg.next = dist;
+			dist = &rootpg;
+		}
+		
+		/*
+		 * Fill all pages. All the pages are new, ie. freshly allocated empty
+		 * pages, or a temporary copy of the old page.
+		 */
+		for (ptr = dist; ptr; ptr = ptr->next)
+		{
+			char	   *data = (char *) (ptr->list);
+
+			for (i = 0; i < ptr->block.num; i++)
+			{
+				if (PageAddItem(ptr->page, 
+							    (Item) data, IndexTupleSize((IndexTuple) data), 
+								i + FirstOffsetNumber, false, false) == 
+					InvalidOffsetNumber)
+					elog(ERROR, "failed to add item to index page in \"%s\"", 
+						RelationGetRelationName(state->r));
+				data += IndexTupleSize((IndexTuple) data);
+			}
+			
+			/* Set up rightlinks */
+			if (ptr->next && ptr->block.blkno != GIST_ROOT_BLKNO)
+				GistPageGetOpaque(ptr->page)->rightlink =
+					ptr->next->block.blkno;
+			else
+				GistPageGetOpaque(ptr->page)->rightlink = oldrlink;
+		}
+
+		/* Mark buffers dirty */
+		for (ptr = dist; ptr; ptr = ptr->next)
+			MarkBufferDirty(ptr->buffer);
+
+		PageRestoreTempPage(dist->page, BufferGetPage(dist->buffer));
+		dist->page = BufferGetPage(dist->buffer);
+
+		/* TODO: Write the WAL record */
+#ifdef NOT_IMPLEMENTED
+		if (RelationNeedsWAL(state->r))
+			recptr = gistXLogSplit(state->r->rd_node, blkno, is_leaf,
+								   dist, oldrlink, oldnsn, leftchildbuf);
+		else
+			recptr = GetXLogRecPtrForTemp();
+#else
+		recptr.xlogid = 0;
+		recptr.xrecoff = 1;
+#endif
+		
+		for (ptr = dist; ptr; ptr = ptr->next)
+		{
+			PageSetLSN(ptr->page, recptr);
+			PageSetTLI(ptr->page, ThisTimeLineID);
+		}
+
+		/* Release buffers, except the one holding the inserted/updated tuple */
+		for (ptr = dist; ptr; ptr = ptr->next)
+		{
+			if (BufferIsValid(ptr->buffer))
+				UnlockReleaseBuffer(ptr->buffer);
+		}
+
+		/*
+		 * If it wasn't root split, we have to insert downlinks to parent page.
+		 */
+		if (!is_rootsplit)
+		{
+			IndexTuple *itups;
+			int cnt = 0, i;
+			
+			for (ptr = dist; ptr; ptr = ptr->next)
+			{
+				cnt++;
+			}
+			itups = (IndexTuple *)palloc(sizeof(IndexTuple) * cnt);
+			i = 0;
+			for (ptr = dist; ptr; ptr = ptr->next)
+			{
+				itups[i] = ptr->itup;
+				i++;
+			}
+
+			if (!is_rootsplit)
+				gistFindCorrectParent(giststate, state->r, path);
+
+			bufferingbuildinsert(state, path->parent, giststate, itups,
+								 cnt,
+								 is_rootsplit ? InvalidOffsetNumber : path->downlinkoffnum);
+		}
+	}
+	else
+	{
+		/*
+		 * Enough of space. Just insert index tuples to the page.
+		 */
+		gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+		MarkBufferDirty(buffer);
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return is_split;
+}
+
+/*
+ * Process index tuple. Run index tuple down until it meet leaf page or
+ * node buffer.
+ */
+static void
+processItup(GISTSTATE *giststate, GISTInsertState *state, 
+			GISTBuildBuffers *gfbb, IndexTuple itup,
+			GISTLoadedPartItem *startparent)
+{
+	GISTLoadedPartItem *path;
+	BlockNumber childblkno; 
+	Buffer		buffer;
+
+	if (!startparent)
+		path = gfbb->rootitem;
+	else
+		path = startparent;
+
+	/*
+	 * Loop until we are on leaf page (level == 0) or we reach level with 
+	 * buffers (if it wasn't level that we've start at).
+	 */
+	for (;;)
+	{
+		ItemId		iid;
+		IndexTuple	idxtuple, newtup;
+		Page		page;
+		OffsetNumber childoffnum;
+		GISTLoadedPartItem *parent;
+
+		if (path != startparent && LEVEL_HAS_BUFFERS(path->level, gfbb))
+			break;
+
+		if (path->level == 0)
+			break;
+
+		/* Choose child for insertion */
+		buffer = ReadBuffer(state->r, path->blkno);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+
+		page = (Page) BufferGetPage(buffer);
+		childoffnum = gistchoose(state->r, page, itup, giststate);
+		iid = PageGetItemId(page, childoffnum);
+		idxtuple = (IndexTuple) PageGetItem(page, iid);
+		childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+
+		/* Adjust key representing child if needed */
+		newtup = gistgetadjusted(state->r, idxtuple, itup, giststate);
+
+		UnlockReleaseBuffer(buffer);
+
+		if (newtup)
+			bufferingbuildinsert(state, path, giststate, 
+								 &newtup, 1, childoffnum);
+
+		/* descend */
+		parent = path;
+		path = (GISTLoadedPartItem *) MemoryContextAlloc(gfbb->context,
+														 sizeof(GISTLoadedPartItem));
+		path->parent = parent;
+		path->level = parent->level - 1;
+		path->blkno = childblkno;
+		path->downlinkoffnum = childoffnum;
+	}
+
+	if (LEVEL_HAS_BUFFERS(path->level, gfbb))
+	{
+		/*
+		 * We've reached level with buffers. Now place index tuple to the
+		 * buffer and add buffer emptying stack element if buffer overflows.
+		 */
+		bool wasOverflowed;
+		NodeBuffer *childNodeBuffer;
+
+		childNodeBuffer = getNodeBuffer(gfbb, giststate, path->blkno, path->downlinkoffnum, path->parent, true);
+		wasOverflowed = BUFFER_IS_OVERLOW(childNodeBuffer, gfbb);
+		pushItupToNodeBuffer(gfbb, childNodeBuffer, itup);
+		if (!wasOverflowed &&  BUFFER_IS_OVERLOW(childNodeBuffer, gfbb))
+		{
+			MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
+			gfbb->bufferEmptyingStack = lcons(childNodeBuffer, gfbb->bufferEmptyingStack);
+			MemoryContextSwitchTo(oldcxt);
+			path = NULL; /* don't free the allocated items */
+		}
+	}
+	else
+	{
+		/*
+		 * We've reached leaf level. So, place index tuple here.
+		 */
+		bufferingbuildinsert(state, path, giststate, &itup, 1, InvalidOffsetNumber);
+	}
+
+	if (path)
+	{
+		while(path != startparent && path != gfbb->rootitem)
+		{
+			GISTLoadedPartItem *parent = path->parent;
+			pfree(path);
+			path = parent;
+		}
+	}
+}
+
+
+static void
+gistFindCorrectParent(GISTSTATE *giststate, Relation r, GISTLoadedPartItem *child)
+{
+	GISTBuildBuffers *gfbb = giststate->gfbb;
+	GISTLoadedPartItem *parent = child->parent;
+	OffsetNumber i,
+		maxoff;
+	ItemId		iid;
+	IndexTuple	idxtuple;
+	Buffer buffer;
+	Page page;
+	bool copied = false;
+
+	buffer = ReadBuffer(r, parent->blkno);
+	page = BufferGetPage(buffer);
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	gistcheckpage(r, buffer);
+
+	/* Check if it has not moved */
+	if (child->downlinkoffnum != InvalidOffsetNumber)
+	{
+		iid = PageGetItemId(page, child->downlinkoffnum);
+		idxtuple = (IndexTuple) PageGetItem(page, iid);
+		if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == child->blkno)
+		{
+			/* Still there */
+			UnlockReleaseBuffer(buffer);
+			return;
+		}
+	}
+
+	/* parent is changed, look child in right links until found */
+	while (true)
+	{
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+		{
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+			if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == child->blkno)
+			{
+				/* yes!!, found */
+				child->downlinkoffnum = i;
+				UnlockReleaseBuffer(buffer);
+				return;
+			}
+		}
+
+		if (!copied)
+		{
+			parent = (GISTLoadedPartItem *) MemoryContextAlloc(gfbb->context,
+															   sizeof(GISTLoadedPartItem));
+			memcpy(parent, child->parent, sizeof(GISTLoadedPartItem));
+			copied = true;
+		}
+
+		parent->blkno = GistPageGetOpaque(page)->rightlink;
+		UnlockReleaseBuffer(buffer);
+
+		if (parent->blkno == InvalidBlockNumber)
+		{
+			/*
+			 * End of chain and still didn't find parent. Should not happen
+			 * during index build.
+			 */
+			break;
+		}
+
+		/* Next page */
+
+		buffer = ReadBuffer(r, parent->blkno);
+		page = BufferGetPage(buffer);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistcheckpage(r, buffer);
+	}
+
+	elog(ERROR, "failed to re-find parent for block %u", child->blkno);
+}
+
+/*
+ * Process buffers emptying stack. Emptying of one buffer can cause emptying 
+ * of other buffers. This function iterates until this cascading emptying
+ * process finished, e.g. until buffers emptying stack is empty.
+ */
+static void
+processEmptyingStack(GISTSTATE *giststate, GISTInsertState *state)
+{
+	GISTBuildBuffers *gfbb = giststate->gfbb;
+	int requiredSizeDecrease = gfbb->pagesPerBuffer * BLCKSZ;
+	
+	/* Iterate while we have elements in buffers emptying stack. */	
+	while (gfbb->bufferEmptyingStack != NIL)
+	{
+		int initialBusySize, busySize;		
+		NodeBuffer *emptyingNodeBuffer;
+		
+		/* Remove element from the stack and prepare for emptying. */
+		emptyingNodeBuffer = (NodeBuffer *) linitial(gfbb->bufferEmptyingStack);
+		gfbb->bufferEmptyingStack = list_delete_first(gfbb->bufferEmptyingStack);
+
+		initialBusySize = getNodeBufferBusySize(emptyingNodeBuffer);
+		gfbb->currentEmptyingBufferSplit = false;
+		
+		while(true)
+		{
+			IndexTuple itup;
+
+			/* Get one index tuple from node buffer and process it */			
+			if (!popItupFromNodeBuffer(gfbb, emptyingNodeBuffer, &itup))
+				break;
+			processItup(giststate, state, gfbb, itup,
+						emptyingNodeBuffer->path);
+
+			/* Free all the memory allocated during index tuple processing */	
+			MemoryContextReset(CurrentMemoryContext);
+
+			/* 
+			 * If current emptying node buffer split we should stop emptying
+			 * just because there is just no such node buffer anymore.
+			 */
+			if (gfbb->currentEmptyingBufferSplit)
+				break;
+			
+			busySize = getNodeBufferBusySize(emptyingNodeBuffer);
+			
+			/* 
+			 * If we've processed half of buffer size limit and buffer is not
+			 * overflowed anymore we should stop in order to avoid exceeding
+			 * of limit in lower buffers.
+			 */
+			if (initialBusySize - busySize >= requiredSizeDecrease && 
+				!BUFFER_IS_OVERLOW(emptyingNodeBuffer, gfbb))
+				break;				
+		}
+	}
+}
+
+/*
+ * Insert function for buffering index build.
+ */
+static void
+gistBufferingBuildInsert(Relation index, IndexTuple itup, 
+						 GISTBuildState *buildstate)
+{
+	GISTBuildBuffers *gfbb = buildstate->giststate.gfbb;
+	GISTInsertState insertstate;
+	
+	memset(&insertstate, 0, sizeof(GISTInsertState));
+	insertstate.freespace = RelationGetTargetPageFreeSpace(index, 
+													GIST_DEFAULT_FILLFACTOR);
+	insertstate.r = index;
+	
+	/* We are ready for index tuple processing */
+	processItup(&buildstate->giststate, &insertstate, gfbb, itup, NULL);
+	
+	/* Process buffer emptying stack if any */
+	processEmptyingStack(&buildstate->giststate, &insertstate);
+}
+
+/*
+ * Per-tuple callback from IndexBuildHeapScan
+ */
+static void
+gistBuildCallback(Relation index,
+				  HeapTuple htup,
+				  Datum *values,
+				  bool *isnull,
+				  bool tupleIsAlive,
+				  void *state)
+{
+	GISTBuildState *buildstate = (GISTBuildState *) state;
+	IndexTuple	itup;
+	MemoryContext oldCtx;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	/* form an index tuple and point it at the heap tuple */
+	itup = gistFormTuple(&buildstate->giststate, index,
+						 values, isnull, true /* size is currently bogus */ );
+	itup->t_tid = htup->t_self;
+	
+	if (buildstate->bufferingMode == 'y')
+	{
+		/* We've decided to use buffering. So let's use buffering insert. */
+		gistBufferingBuildInsert(index, itup, buildstate);
+	}
+	else
+	{
+		BlockNumber		leafBlocknum;
+		LeafPageInfo   *leafInfo;
+		bool			found;
+		
+		/* We didn't decide to use buffering yet or aren't goint to use it at
+		 * all. Since we already have the index relation locked, we call
+		 * gistdoinsert directly.  Normal access method calls dispatch through
+		 * gistinsert, which locks the relation for write.	This is the right
+		 * thing to do if you're inserting single tups, but not when you're
+		 * initializing the whole index at once.
+		 *
+		 * In this path we respect the fillfactor setting, whereas insertions
+		 * after initial build do not.
+		 */
+		leafBlocknum = gistdoinsert(index, itup,
+				 RelationGetTargetPageFreeSpace(index, GIST_DEFAULT_FILLFACTOR),
+					 &buildstate->giststate);
+		
+		if (buildstate->bufferingMode == 'a')
+		{
+			/* Add information about leaf pages accesses */
+			leafInfo = (LeafPageInfo *) hash_search(buildstate->leafPagesTab,
+												 (const void *) &leafBlocknum,
+												 HASH_ENTER, 
+												 &found);
+			leafInfo->lastTupleNumber = buildstate->indtuples;
+		}
+		
+	}
+	
+	buildstate->indtuples += 1;
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(buildstate->tmpCtx);
+	
+	if (buildstate->bufferingMode == 'a' &&
+		buildstate->indtuples % LEAF_PAGES_STATS_STEP == 0)
+	{
+		HASH_SEQ_STATUS scan_status;
+		LeafPageInfo   *leafInfo;
+		int				removedItemsCount = 0;
+		BlockNumber		indexSize;
+			
+		/* 
+		 * Count leaf pages which was accessed by previous chunk of index
+		 * tuples and wasn't accessed by last chunk of index tuples.
+		 */
+		hash_seq_init(&scan_status, buildstate->leafPagesTab);		
+		while((leafInfo = (LeafPageInfo *)hash_seq_search(&scan_status)) != NULL)
+		{
+			if (leafInfo->blockNumber < buildstate->indtuples - LEAF_PAGES_STATS_STEP)
+			{
+				if (hash_search(buildstate->leafPagesTab, 
+								(const void *) &leafInfo->blockNumber,
+								HASH_REMOVE, NULL) == NULL)
+					elog(ERROR, "hash table corrupted");
+				removedItemsCount++;
+			}
+		}
+		buildstate->nonHitLeafPagesStats[buildstate->nonHitLeafPagesStatsIndex] =
+			removedItemsCount;
+		buildstate->nonHitLeafPagesStatsIndex = 
+			(buildstate->nonHitLeafPagesStatsIndex + 1) % LEAF_PAGES_STATS_COUNT;
+		
+		indexSize = smgrnblocks(index->rd_smgr, MAIN_FORKNUM);
+
+		/*
+		 * Check if we are going to switch to buffering build.
+		 */
+		if (buildstate->bufferingMode == 'a' && 
+			effective_cache_size < indexSize && 
+			indexSize >= 2 * LEAF_PAGES_STATS_STEP)
+		{
+			int i, nonHitLeafPages = 0;
+			double lambda, factor;
+			for (i = 0; i < LEAF_PAGES_STATS_COUNT; i++)
+			{
+				nonHitLeafPages += buildstate->nonHitLeafPagesStats[i];
+			}
+			/* Calculate parameter of exponential distribution. */
+			lambda = (-1.0 / (double)LEAF_PAGES_STATS_STEP) * 
+					 log((double)nonHitLeafPages / (double)(LEAF_PAGES_STATS_STEP * LEAF_PAGES_STATS_COUNT));
+			
+			/* Estimate portion index tuples which can be processed without IO */			
+			factor = (1 - exp(- (double)effective_cache_size * lambda)) / 
+					 (1 - exp(- (double)indexSize * lambda));
+
+			/* If estimated portion exceeds threshold then switch to buffering build */			
+			if (factor < 0.95)
+			{
+				if (initBuffering(buildstate, index))
+				{
+					/*
+					 * Buffering build is successfully initialized. Now we can
+					 * set appropriate flag.
+					 */
+					buildstate->bufferingMode = 'y';
+					elog(INFO, "switching to buffered mode");
+				}
+				else
+				{
+					/*
+					 * Failed to switch to buffering build due to not enough
+					 * memory settings. Mark that we aren't going to switch
+					 * anymore.
+					 */
+					buildstate->bufferingMode = 'n';
+				}
+			}
+		}	
+	}
+}
+
+/*
+ * Initial calculations for GiST buffering build.
+ */
+static bool
+initBuffering(GISTBuildState *buildstate, Relation index)
+{
+	int			pagesPerBuffer = -1;
+	bool		neighborRelocation = true;
+	Size		pageFreeSpace;
+	Size		itupMinSize;
+	int			i, maxIndexTuplesCount;
+	int			effectiveMemory;
+	int			levelStep = 0;
+	GISTBuildBuffers *gfbb;
+	
+	/* try to get user difened options */		
+	if (index->rd_options)
+	{
+		GiSTOptions *options = (GiSTOptions *)index->rd_options;
+		levelStep = options->levelStep;
+		pagesPerBuffer = options->bufferSize;
+		neighborRelocation = options->neighborRelocation;
+	}
+
+	/* calc number of index tuples which fit in page */		
+	pageFreeSpace = BLCKSZ - SizeOfPageHeaderData - 
+		sizeof(GISTPageOpaqueData) - sizeof(ItemIdData);
+	itupMinSize = (Size)MAXALIGN(sizeof(IndexTupleData));
+	for (i = 0; i < index->rd_att->natts; i++)
+	{
+		if (index->rd_att->attrs[i]->attlen < 0)
+			itupMinSize += VARHDRSZ;
+		else
+			itupMinSize += index->rd_att->attrs[i]->attlen;
+	}
+	maxIndexTuplesCount = pageFreeSpace / itupMinSize;
+
+	/* calculate level step if it isn't specified by user */		
+	effectiveMemory = Min(maintenance_work_mem * 1024 / BLCKSZ,
+						  effective_cache_size);
+	if (levelStep <= 0)
+	{
+		levelStep = (int)log((double)effectiveMemory / 4.0) / 
+			log((double)maxIndexTuplesCount);
+		if (levelStep < 0) 
+			levelStep = 0;
+	}
+
+	/* calculate buffer size if it isn't specified by user */
+	if (pagesPerBuffer <= 0)
+	{
+		pagesPerBuffer = 2;
+		for (i = 0; i < levelStep; i++)
+		{
+			pagesPerBuffer *= maxIndexTuplesCount;
+		}
+	}
+
+	if (levelStep > 0)
+	{
+		/* Enough of memory for at least level_step == 1. */
+		gfbb = palloc(sizeof(GISTBuildBuffers));
+		gfbb->pagesPerBuffer = pagesPerBuffer;
+		gfbb->levelStep = levelStep;
+		gfbb->neighborRelocation = neighborRelocation;
+		initGiSTBuildBuffers(gfbb);
+		buildstate->giststate.gfbb = gfbb;
+		elog(INFO, "Level step = %d, pagesPerBuffer = %d", levelStep, 
+			 pagesPerBuffer);
+		return true;
+	}
+	else
+	{
+		/* Not enough memory for buffering build. */
+		return false;
+	}
+}
+
+/*
+ * Routine to build an index.  Basically calls insert over and over.
+ *
+ * XXX: it would be nice to implement some sort of bulk-loading
+ * algorithm, but it is not clear how to do that.
+ */
+Datum
+gistbuild(PG_FUNCTION_ARGS)
+{
+	Relation	heap = (Relation) PG_GETARG_POINTER(0);
+	Relation	index = (Relation) PG_GETARG_POINTER(1);
+	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+	IndexBuildResult *result;
+	double		reltuples;
+	GISTBuildState buildstate;
+	Buffer		buffer;
+	Page		page;
+	HASHCTL		hashCtl;
+	int			i;
+	MemoryContext oldcxt = CurrentMemoryContext;
+	
+	if (index->rd_options)
+	{
+		/* Get buffering mode from the options string */
+		GiSTOptions *options = (GiSTOptions *)index->rd_options;
+		char *bufferingMode = (char *)options + options->bufferingModeOffset;
+		if (!strcmp(bufferingMode, "on"))
+			buildstate.bufferingMode = 'y';
+		if (!strcmp(bufferingMode, "off"))
+			buildstate.bufferingMode = 'n';
+		if (!strcmp(bufferingMode, "auto"))
+			buildstate.bufferingMode = 'a';
+	}
+	else
+	{
+		/* Automatic buffering mode switching by default */
+		buildstate.bufferingMode = 'a';
+	}
+	
+	if (buildstate.bufferingMode == 'a')
+	{
+		/* Init hash for tracking leaf pages accesses */
+		hashCtl.keysize = sizeof(BlockNumber);
+		hashCtl.entrysize = sizeof(LeafPageInfo);
+		hashCtl.hcxt = CurrentMemoryContext;	
+		hashCtl.hash = tag_hash;
+		hashCtl.match = memcmp;
+		buildstate.leafPagesTab = hash_create(
+					"leafpagestab", 
+					2 * LEAF_PAGES_STATS_STEP, 
+					&hashCtl,
+					HASH_ELEM | HASH_CONTEXT | HASH_FUNCTION | HASH_COMPARE);
+	}
+	else
+		buildstate.leafPagesTab = NULL;
+
+	/*
+	 * We expect to be called exactly once for any index relation. If that's
+	 * not the case, big trouble's what we have.
+	 */
+	if (RelationGetNumberOfBlocks(index) != 0)
+		elog(ERROR, "index \"%s\" already contains data",
+			 RelationGetRelationName(index));
+
+	/* no locking is needed */
+	initGISTstate(&buildstate.giststate, index);
+	if (buildstate.bufferingMode == 'y')
+	{
+		if (!initBuffering(&buildstate, index))
+			buildstate.bufferingMode = 'n';
+	}
+
+	/* initialize the root page */
+	buffer = gistNewBuffer(index);
+	Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
+	page = BufferGetPage(buffer);
+
+	START_CRIT_SECTION();
+
+	GISTInitBuffer(buffer, F_LEAF);
+
+	MarkBufferDirty(buffer);
+
+	if (RelationNeedsWAL(index))
+	{
+		XLogRecPtr	recptr;
+		XLogRecData rdata;
+
+		rdata.data = (char *) &(index->rd_node);
+		rdata.len = sizeof(RelFileNode);
+		rdata.buffer = InvalidBuffer;
+		rdata.next = NULL;
+
+		recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_CREATE_INDEX, &rdata);
+		PageSetLSN(page, recptr);
+		PageSetTLI(page, ThisTimeLineID);
+	}
+	else
+		PageSetLSN(page, GetXLogRecPtrForTemp());
+
+	UnlockReleaseBuffer(buffer);
+
+	END_CRIT_SECTION();
+
+	/* build the index */
+	buildstate.numindexattrs = indexInfo->ii_NumIndexAttrs;
+	buildstate.indtuples = 0;
+	buildstate.nonHitLeafPagesStatsIndex = 0;
+	for (i = 0; i < LEAF_PAGES_STATS_COUNT; i++)
+		buildstate.nonHitLeafPagesStats[i] = 0;
+
+	/*
+	 * create a temporary memory context that is reset once for each tuple
+	 * inserted into the index
+	 */
+	buildstate.tmpCtx = createTempGistContext();
+	
+	/* 
+	 * Do the heap scan.
+	 */
+	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
+							   gistBuildCallback, (void *) &buildstate);
+	
+	/*
+	 * If buffering build do final node buffers emptying.
+	 */
+	if (buildstate.bufferingMode == 'y')
+	{
+		int i;
+		GISTInsertState insertstate;
+		MemoryContext oldCtx;
+		GISTBuildBuffers *gfbb = buildstate.giststate.gfbb;
+		NodeBuffer **buffers;
+		NodeBuffer *buf;
+		HASH_SEQ_STATUS scan_status;
+		int		nbuffers;
+
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		
+		memset(&insertstate, 0, sizeof(GISTInsertState));
+		insertstate.freespace = RelationGetTargetPageFreeSpace(index, 
+													GIST_DEFAULT_FILLFACTOR);
+		insertstate.r = index;
+
+		for (;;)
+		{
+			/* Collect all buffers into array */
+			buffers = MemoryContextAlloc(gfbb->context,
+										 hash_get_num_entries(gfbb->nodeBuffersTab) * sizeof(NodeBuffer *));
+			nbuffers = 0;
+			/* Sort the buffers by level */
+			hash_seq_init(&scan_status, gfbb->nodeBuffersTab);
+			while ((buf = hash_seq_search(&scan_status)) != NULL)
+			{
+				if (buf->tuplesCount > 0)
+					buffers[nbuffers++] = buf;
+			}
+			if (nbuffers == 0)
+				break;
+
+			/*
+			 * Iterate through the buffers, from top to bottom
+			 */
+			qsort(buffers, nbuffers, sizeof(NodeBuffer *), bufferlevel_cmp);
+			for (i = 0; i < nbuffers; i++)
+			{
+				MemoryContext oldcxt;
+
+				oldcxt = MemoryContextSwitchTo(gfbb->context);
+				gfbb->bufferEmptyingStack = lcons(buffers[i], gfbb->bufferEmptyingStack);
+				MemoryContextSwitchTo(oldcxt);
+
+				processEmptyingStack(&buildstate.giststate, &insertstate);
+			}
+			/*
+			 * Emptying these buffers might've created new buffers, so iterate
+			 * until we're fully done
+			 */
+		}
+		for (i = 0; i < nbuffers; i++)
+		{
+			Assert(buffers[i]->tuplesCount == 0);
+		}
+	}
+	
+	/* okay, all heap tuples are indexed */
+	MemoryContextSwitchTo(oldcxt);
+	MemoryContextDelete(buildstate.tmpCtx);
+
+	freeGISTstate(&buildstate.giststate);
+
+	/*
+	 * Return statistics
+	 */
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+	result->heap_tuples = reltuples;
+	result->index_tuples = (double)buildstate.indtuples;
+
+	PG_RETURN_POINTER(result);
+}
+
+/*
+ * qsort comparator for sorting NodeBuffers by level.
+ */
+static int
+bufferlevel_cmp(const void *a, const void *b)
+{
+	NodeBuffer *abuf = *((NodeBuffer **) a);
+	NodeBuffer *bbuf = *((NodeBuffer **) b);
+
+	return (bbuf->level - abuf->level);
+}
+
+
+#ifdef GIST_DEBUG
+static void
+gist_dumptree(Relation r, int level, BlockNumber blk, OffsetNumber coff, BOX *downlink, StringInfo out, int maxlevel) {
+	Buffer		buffer;
+	Page		page;
+	IndexTuple	which;
+	ItemId		iid;
+	OffsetNumber i,
+				maxoff;
+	BlockNumber cblk;
+	char	   *pred;
+
+	pred = (char *) palloc(sizeof(char) * level * 4 + 1);
+	MemSet(pred, ' ', level*4);
+	pred[level*4] = '\0';
+
+	buffer = ReadBuffer(r, blk);
+	page = (Page) BufferGetPage(buffer);
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	appendStringInfo(out, "%s%d(l:%d) blk: %d numTuple: %d free: %db(%.2f%%) rightlink:%u (%s) (%.5g, %.5g),(%.5g, %.5g)\n", 
+		pred,
+		coff, 
+		level, 
+		(int) blk,
+		(int) maxoff, 
+		PageGetFreeSpace(page),  
+		100.0*(((float)BLCKSZ)-(float)PageGetFreeSpace(page))/((float)BLCKSZ),
+		GistPageGetOpaque(page)->rightlink,
+					 ( GistPageGetOpaque(page)->rightlink == InvalidBlockNumber ) ? "InvalidBlockNumber" : "OK",
+					 downlink ? downlink->low.x : 0,
+					 downlink ? downlink->low.y : 0,
+					 downlink ? downlink->high.x : 0,
+					 downlink ? downlink->high.y :0 );
+
+	if (maxlevel<0 || level<maxlevel)
+	{
+		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+		{
+			bool isnull;
+			BOX *box;
+
+			iid = PageGetItemId(page, i);
+			which = (IndexTuple) PageGetItem(page, iid);
+
+			box = (BOX *)index_getattr(which, 1, RelationGetDescr(r), &isnull);
+
+			if (!GistPageIsLeaf(page))
+			{
+				cblk = ItemPointerGetBlockNumber(&(which->t_tid));
+				gist_dumptree(r, level + 1, cblk, i, box, out, maxlevel);
+			}
+			else
+			{
+				bool invalid = false;
+				if (downlink)
+				{
+					if (box->high.x > downlink->high.x)
+						invalid = true;
+					if (box->high.y > downlink->high.y)
+						invalid = true;
+					if (box->high.x < downlink->low.x)
+						invalid = true;
+					if (box->high.y < downlink->low.y)
+						invalid = true;
+				}
+
+				if (invalid)
+				appendStringInfo(out, "%s    leaf item %d points to %u/%u: (%.5g, %.5g)%s\n",
+								 pred, i,
+								 ItemPointerGetBlockNumber(&(which->t_tid)),
+								 ItemPointerGetOffsetNumber(&(which->t_tid)),
+								 box->high.x, box->high.y, invalid ? " INVALID" : "");
+			}
+		}
+	}
+	ReleaseBuffer(buffer);
+	pfree(pred);
+}
+
+static int
+gist_counttups(Relation r, BlockNumber blk)
+{
+	Buffer		buffer;
+	Page		page;
+	IndexTuple	which;
+	ItemId		iid;
+	OffsetNumber i,
+				maxoff;
+	BlockNumber cblk;
+	int			ntuples;
+
+	buffer = ReadBuffer(r, blk);
+	page = (Page) BufferGetPage(buffer);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	ntuples = 0;
+	for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+	{
+		iid = PageGetItemId(page, i);
+		which = (IndexTuple) PageGetItem(page, iid);
+
+		if (!GistPageIsLeaf(page))
+		{
+			cblk = ItemPointerGetBlockNumber(&(which->t_tid));
+			ntuples += gist_counttups(r, cblk);
+		}
+		else
+			ntuples++;
+	}
+	ReleaseBuffer(buffer);
+
+	return ntuples;
+}
+
+static int
+gist_count_tuples_in_buffers(GISTBuildBuffers *gfbb)
+{
+	HASH_SEQ_STATUS scan_status;
+	NodeBuffer *buf;
+	int ntuples = 0;
+
+	hash_seq_init(&scan_status, gfbb->nodeBuffersTab);
+	while ((buf = hash_seq_search(&scan_status)) != NULL)
+		ntuples += buf->tuplesCount;
+	return ntuples;
+}
+#endif
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
new file mode 100644
index 0000000..02c91e7
--- /dev/null
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -0,0 +1,600 @@
+/*-------------------------------------------------------------------------
+ *
+ * gistbuildbuffers.c
+ *	  buffers management functions for GiST buffering build algorithm.
+ *
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/gist/gistbuildbuffers.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/gist_private.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/indexfsm.h"
+#include "storage/buffile.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+static NodeBufferPage *allocateNewPageBuffer(GISTBuildBuffers *gfbb);
+static void placeItupToPage(NodeBufferPage *pageBuffer, IndexTuple item);
+static void getItupFromPage(NodeBufferPage *pageBuffer, IndexTuple *item);
+static int freeBlocks_cmp(const void *a, const void *b);
+static long getFreeBlock(GISTBuildBuffers *gfbb);
+static void releaseBlock(GISTBuildBuffers *gfbb, long blocknum);
+
+/*
+ * Initialize GiST buffering build structure.
+ */
+void
+initGiSTBuildBuffers(GISTBuildBuffers * gfbb)
+{
+	HASHCTL		hashCtl;
+
+	/*
+	 * Create temporary file initialize data structures for free pages
+	 * management.
+	 */
+	gfbb->pfile = BufFileCreateTemp(true);
+	gfbb->nFileBlocks = 0;
+	gfbb->nFreeBlocks = 0;
+	gfbb->blocksSorted = false;
+	gfbb->freeBlocksLen = 32;
+	gfbb->freeBlocks = (long *) palloc(gfbb->freeBlocksLen * sizeof(long));
+	
+	/*
+	 * Current memory context will be used for all in-memory data structures
+	 * of buffers.
+	 */
+	gfbb->context = CurrentMemoryContext;
+	
+	/*
+	 * nodeBuffersTab hash is association between index blocks and it's buffer.
+	 */
+	hashCtl.keysize = sizeof(BlockNumber);
+	hashCtl.entrysize = sizeof(NodeBuffer);
+	hashCtl.hcxt = CurrentMemoryContext;	
+	hashCtl.hash = tag_hash;
+	hashCtl.match = memcmp;
+	gfbb->nodeBuffersTab = hash_create(
+								"gistbuildbuffers", 
+								1024, 
+								&hashCtl,
+								HASH_ELEM | HASH_CONTEXT | HASH_FUNCTION | HASH_COMPARE);
+	
+	/*
+	 * Stack of node buffers which was planned for emptying.
+	 */
+	gfbb->bufferEmptyingStack = NIL;
+	
+	gfbb->currentEmptyingBufferBlockNumber = InvalidBlockNumber;
+	gfbb->currentEmptyingBufferSplit = false;
+
+	gfbb->rootitem = (GISTLoadedPartItem *) MemoryContextAlloc(gfbb->context,
+															   sizeof(GISTLoadedPartItem));
+	gfbb->rootitem->parent = NULL;
+	gfbb->rootitem->blkno = GIST_ROOT_BLKNO;
+	gfbb->rootitem->downlinkoffnum = InvalidOffsetNumber;
+	gfbb->rootitem->level = 0;
+}
+
+/*
+ * Return NodeBuffer structure by it's block number. If createNew flag is
+ * specified then new NodeBuffer structure will be created on it's absence.
+ */
+NodeBuffer *
+getNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
+			  BlockNumber nodeBlocknum,
+			  OffsetNumber downlinkoffnum,
+			  GISTLoadedPartItem *parent,
+			  bool createNew)
+{
+	NodeBuffer *nodeBuffer;
+	bool found;
+
+	/* Find nodeBuffer in hash table */
+	nodeBuffer = (NodeBuffer *) hash_search(gfbb->nodeBuffersTab,
+											 (const void *) &nodeBlocknum,
+											 createNew ? HASH_ENTER : HASH_FIND, 
+											 &found);
+	if (!found)
+	{
+		GISTLoadedPartItem *path;
+
+		/* 
+		 * Node buffer wasn't found. Create new if required.
+		 */
+		if (!createNew)
+			return NULL;
+
+		if (nodeBlocknum != GIST_ROOT_BLKNO)
+		{
+			path = (GISTLoadedPartItem *) MemoryContextAlloc(gfbb->context,
+															 sizeof(GISTLoadedPartItem));
+			path->parent = parent;
+			path->blkno = nodeBlocknum;
+			path->downlinkoffnum = downlinkoffnum;
+			path->level = parent->level - 1;
+			Assert(path->level > 0);
+		}
+		else
+			path = gfbb->rootitem;
+		
+		/*
+		 * New node buffer. Fill data structure with default values.
+		 */		
+		nodeBuffer->pageBuffer = NULL;
+		nodeBuffer->blocksCount = 0;
+		nodeBuffer->tuplesCount = 0;
+		nodeBuffer->level = path->level;
+		nodeBuffer->path = path;
+	}
+	else
+	{
+		Assert(nodeBuffer->path->parent == parent);
+	}
+
+	return nodeBuffer;
+}
+
+/*
+ * Allocate memory for buffer page.
+ */
+static NodeBufferPage *
+allocateNewPageBuffer(GISTBuildBuffers *gfbb)
+{
+	NodeBufferPage *pageBuffer;
+	/* Allocate memory for page in appropriate context. */
+	pageBuffer = (NodeBufferPage *) MemoryContextAlloc(gfbb->context, BLCKSZ);	
+	/* Set page free space */
+	PAGE_FREE_SPACE(pageBuffer) = BLCKSZ - MAXALIGN(sizeof(uint32));	
+	return pageBuffer;
+}
+
+/*
+ * Add item to buffer page.
+ */
+static void
+placeItupToPage(NodeBufferPage *pageBuffer, IndexTuple itup)
+{
+	/* Get pointer to page free space start */
+	char *ptr = (char *)pageBuffer + PAGE_FREE_SPACE(pageBuffer)
+								   - MAXALIGN(IndexTupleSize(itup));
+	/* There should be enough of space */
+	Assert(PAGE_FREE_SPACE(pageBuffer) >= MAXALIGN(IndexTupleSize(itup)));
+	/* Copy index tuple to free space */
+	memcpy(ptr, itup, IndexTupleSize(itup));
+	/* Reduce free space value of page */
+	PAGE_FREE_SPACE(pageBuffer) -= MAXALIGN(IndexTupleSize(itup));
+}
+
+/*
+ * Get last item from buffer page and remove it from page.
+ */
+static void
+getItupFromPage(NodeBufferPage *pageBuffer, IndexTuple *itup)
+{
+	/* Get pointer to last index tuple */
+	IndexTuple ptr = (IndexTuple)((char *)pageBuffer + 
+								  PAGE_FREE_SPACE(pageBuffer));	
+	/* Page shouldn't be empty */
+	Assert(!PAGE_IS_EMPTY(pageBuffer));
+	/* Allocate memory for returned index tuple copy */
+	*itup = (IndexTuple)palloc(IndexTupleSize(ptr));
+	/* Copy data */
+	memcpy(*itup, ptr, IndexTupleSize(ptr));
+	/* Increase free space value of page */
+	PAGE_FREE_SPACE(pageBuffer) += MAXALIGN(IndexTupleSize(*itup));	
+}
+
+/*
+ * Push new index tuple to node buffer.
+ */
+void
+pushItupToNodeBuffer(GISTBuildBuffers *gfbb, NodeBuffer *nodeBuffer, 
+					 IndexTuple itup)
+{
+	/*
+	 * Most part of memory operations will be in buffering build persistent
+	 * context. So, let's switch to it.
+	 */
+	MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
+
+	/*
+	 * If there wasn't any data in buffer before then we should add this node
+	 * buffer to nonempty buffers list.
+	 * Allocate a page buffer if we don't have one yet.
+	 */
+	if (nodeBuffer->blocksCount == 0)
+	{
+		nodeBuffer->pageBuffer = allocateNewPageBuffer(gfbb);
+		nodeBuffer->pageBuffer->prev = InvalidBlockNumber;
+		nodeBuffer->blocksCount = 1;
+	}
+
+	/* Check if there is enough space on the last page for the tuple */
+	if (PAGE_NO_SPACE(nodeBuffer->pageBuffer, itup))
+	{
+		/* Swap previous block to disk and allocate new one */
+		BlockNumber blkno;
+
+		nodeBuffer->blocksCount++;
+
+		blkno = getFreeBlock(gfbb);
+		BufFileSeekBlock(gfbb->pfile, blkno);
+		BufFileWrite(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+
+		nodeBuffer->pageBuffer = allocateNewPageBuffer(gfbb);
+		nodeBuffer->pageBuffer->prev = blkno;
+	}
+
+	placeItupToPage(nodeBuffer->pageBuffer, itup);
+
+	/* Increase tuples counter of node buffer */
+	nodeBuffer->tuplesCount++;
+
+	/* Restore memory context */
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * Removes one index tuple from node buffer. Returns true if success and false
+ * if node buffer is empty.
+ */
+bool
+popItupFromNodeBuffer(GISTBuildBuffers *gfbb, NodeBuffer *nodeBuffer, 
+					  IndexTuple *itup)
+{
+	/* If node buffer is empty then return false. */
+	if (nodeBuffer->blocksCount <= 0)
+		return false;
+	
+	/* Get index tuple from last non-empty page and mark it as dirty. */
+	getItupFromPage(nodeBuffer->pageBuffer, itup);
+	
+	/* Check if the page which the index tuple was got from is now empty */	
+	if (PAGE_IS_EMPTY(nodeBuffer->pageBuffer))
+	{
+		BlockNumber prevblkno;
+		/* 
+		 * If it's empty then we need to release buffer file block and free
+		 * page buffer.
+		 */
+		nodeBuffer->blocksCount--;
+
+		/* If there's more pages, fetch previous one */
+		prevblkno = nodeBuffer->pageBuffer->prev;
+		if (prevblkno != InvalidBlockNumber)
+		{
+			Assert(nodeBuffer->blocksCount > 0);
+			BufFileSeekBlock(gfbb->pfile, prevblkno);
+			BufFileRead(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+
+			releaseBlock(gfbb, prevblkno);
+		}
+		else
+		{
+			Assert(nodeBuffer->blocksCount == 0);
+			pfree(nodeBuffer->pageBuffer);
+			nodeBuffer->pageBuffer = NULL;
+		}
+	}
+	
+	/* Decrease node buffer tuples counter. */
+	nodeBuffer->tuplesCount--;
+	return true;
+}
+
+/*
+ * qsort comparator for sorting freeBlocks[] into decreasing order.
+ */
+static int
+freeBlocks_cmp(const void *a, const void *b)
+{
+	long		ablk = *((const long *) a);
+	long		bblk = *((const long *) b);
+
+	/* can't just subtract because long might be wider than int */
+	if (ablk < bblk)
+		return 1;
+	if (ablk > bblk)
+		return -1;
+	return 0;
+}
+
+/*
+ * Select a currently unused block for writing to.
+ *
+ * NB: should only be called when writer is ready to write immediately,
+ * to ensure that first write pass is sequential.
+ */
+static long
+getFreeBlock(GISTBuildBuffers *gfbb)
+{
+	/*
+	 * If there are multiple free blocks, we select the one appearing last in
+	 * freeBlocks[] (after sorting the array if needed).  If there are none,
+	 * assign the next block at the end of the file.
+	 */
+	if (gfbb->nFreeBlocks > 0)
+	{
+		if (!gfbb->blocksSorted)
+		{
+			qsort((void *) gfbb->freeBlocks, gfbb->nFreeBlocks,
+				  sizeof(long), freeBlocks_cmp);
+			gfbb->blocksSorted = true;
+		}
+		return gfbb->freeBlocks[--gfbb->nFreeBlocks];
+	}
+	else
+		return gfbb->nFileBlocks++;
+}
+
+/*
+ * Return a block# to the freelist.
+ */
+static void
+releaseBlock(GISTBuildBuffers *gfbb, long blocknum)
+{
+	int			ndx;
+
+	/*
+	 * Enlarge freeBlocks array if full.
+	 */
+	if (gfbb->nFreeBlocks >= gfbb->freeBlocksLen)
+	{
+		gfbb->freeBlocksLen *= 2;
+		gfbb->freeBlocks = (long *) repalloc(gfbb->freeBlocks,
+										  gfbb->freeBlocksLen * sizeof(long));
+	}
+
+	/*
+	 * Add blocknum to array, and mark the array unsorted if it's no longer in
+	 * decreasing order.
+	 */
+	ndx = gfbb->nFreeBlocks++;
+	gfbb->freeBlocks[ndx] = blocknum;
+	if (ndx > 0 && gfbb->freeBlocks[ndx - 1] < blocknum)
+		gfbb->blocksSorted = false;
+}
+
+/*
+ * Free buffering build data structure.
+ */
+void 
+freeGiSTBuildBuffers(GISTBuildBuffers *gfbb)
+{
+	/* Close buffers file. */
+	BufFileClose(gfbb->pfile);
+	/* All other things will be free on memory context release */
+}
+
+/*
+ * Data structure representing information about node buffer for index tuples
+ * relocation from splitted node buffer.
+ */
+typedef struct
+{
+	GISTENTRY			entry[INDEX_MAX_KEYS];
+	bool				isnull[INDEX_MAX_KEYS];
+	SplitedPageLayout  *dist;
+	NodeBuffer		   *nodeBuffer;
+	OffsetNumber		offnum;
+} RelocationBufferInfo;
+
+/*
+ * Maintain data structures on page split.
+ */
+void
+relocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
+							Relation r, GISTLoadedPartItem *path,
+							Buffer buffer, SplitedPageLayout *dist)
+{
+	RelocationBufferInfo *relocationBuffersInfos;
+	bool found;
+	NodeBuffer *nodeBuffer;
+	BlockNumber blocknum;	
+	IndexTuple itup;
+	int splitPagesCount = 0, i;
+	GISTENTRY	entry[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	SplitedPageLayout *ptr;
+	int level = path->level;
+	NodeBuffer nodebuf;
+	SplitedPageLayout *last = NULL;
+
+	blocknum = BufferGetBlockNumber(buffer);
+
+	/*
+	 * If splitted page level doesn't have buffers, then everything is done.
+	 * Otherwise we also need to relocate index tuples of buffer of splitted 
+	 * page.
+	 */
+	if (!LEVEL_HAS_BUFFERS(level, gfbb))
+		return;
+
+	/*
+	 * Get pointer of node buffer of splitted page and remove it from the hash.
+	 */	
+	nodeBuffer = hash_search(gfbb->nodeBuffersTab, &blocknum,
+							 HASH_FIND, &found);
+	if (!found)
+	{
+		/*
+		 * Node buffer should anyway be created at this moment. Either by
+		 * index tuples insertion or page split.
+		 */
+		elog(ERROR, "node buffer of splitting page (%u) doesn't exists while it should.", blocknum);
+	}
+
+	/*
+	 * We are going to perform some operations with node buffers hash. Thus,
+	 * it unsafe to operate with already removed hash item. Let's save it.
+	 */		
+	memcpy(&nodebuf, nodeBuffer, sizeof(NodeBuffer));
+	nodeBuffer = &nodebuf;
+
+	hash_search(gfbb->nodeBuffersTab, &blocknum, HASH_REMOVE, &found);
+	Assert(found);
+
+	/*
+	 * Count pages produced by split and save pointer data structure of
+	 * the last one.
+	 */
+	for (ptr = dist; ptr; ptr = ptr->next)
+	{
+		last = ptr;
+		splitPagesCount++;
+	}
+
+	/* Allocate memory for information about relocation buffers. */
+	relocationBuffersInfos = (RelocationBufferInfo *)palloc(
+		sizeof(RelocationBufferInfo) * splitPagesCount);
+
+	/*
+	 * Fill relocation buffers information for node buffers of pages
+	 * produced by split.
+	 */		
+	i = 0;
+	for (ptr = dist; ptr; ptr = ptr->next)
+	{
+		/* Decompress parent index tuple of node buffer page. */
+		gistDeCompressAtt(giststate, r,
+						  ptr->itup, NULL, (OffsetNumber) 0,
+						  relocationBuffersInfos[i].entry, 
+						  relocationBuffersInfos[i].isnull);
+
+		/* Fill relocation information */
+		relocationBuffersInfos[i].nodeBuffer = 
+			getNodeBuffer(gfbb, giststate, ptr->block.blkno,
+						  path->downlinkoffnum, path->parent, true);
+
+		/* Fill node buffer structure */
+		relocationBuffersInfos[i].dist = ptr;
+
+		i++;
+	}
+
+	/*
+	 * Loop of index tuples relocation.
+	 */		
+	while (popItupFromNodeBuffer(gfbb, nodeBuffer, &itup))
+	{
+		float sum_grow,	which_grow[INDEX_MAX_KEYS];
+		int i, which;
+		IndexTuple newtup;
+		bool wasOverflow;
+
+		/* Choose node buffer for index tuple insert. */
+
+		gistDeCompressAtt(giststate, r,
+						  itup, NULL, (OffsetNumber) 0,
+						  entry, isnull);
+
+		which = -1;
+		*which_grow = -1.0f;
+		sum_grow = 1.0f;
+
+		for (i = 0; i < splitPagesCount && sum_grow; i++)
+		{
+			int j;
+			RelocationBufferInfo *splitPageInfo = & relocationBuffersInfos[i];
+
+			sum_grow = 0.0f;
+			for (j = 0; j < r->rd_att->natts; j++)
+			{
+				float		usize;
+
+				usize = gistpenalty(giststate, j, 
+									&splitPageInfo->entry[j], 
+									splitPageInfo->isnull[j],
+									&entry[j], isnull[j]);
+
+				if (which_grow[j] < 0 || usize < which_grow[j])
+				{
+					which = i;
+					which_grow[j] = usize;
+					if (j < r->rd_att->natts - 1 && i == 0)
+						which_grow[j + 1] = -1;
+					sum_grow += which_grow[j];
+				}
+				else if (which_grow[j] == usize)
+					sum_grow += usize;
+				else
+				{
+					sum_grow = 1;
+						break;
+				}
+			}
+		}
+			
+		wasOverflow = BUFFER_IS_OVERLOW(
+			relocationBuffersInfos[which].nodeBuffer, gfbb);
+
+		/* push item to selected node buffer */
+		pushItupToNodeBuffer(gfbb, relocationBuffersInfos[which].nodeBuffer, 
+							 itup);
+			
+		/* 
+		 * If node buffer was just overflowed then we should add it to the
+		 * emptying stack.
+		 */
+		if (BUFFER_IS_OVERLOW(relocationBuffersInfos[which].nodeBuffer, gfbb) &&
+			!wasOverflow)
+		{
+			MemoryContext oldcxt = CurrentMemoryContext;
+			MemoryContextSwitchTo(gfbb->context);
+			gfbb->bufferEmptyingStack = lcons(relocationBuffersInfos[which].nodeBuffer, gfbb->bufferEmptyingStack);
+			MemoryContextSwitchTo(oldcxt);
+		}
+
+		/* adjust tuple of parent page */
+		newtup = gistgetadjusted(r, relocationBuffersInfos[which].dist->itup, 
+								 itup, giststate);
+		if (newtup)
+		{
+			/*
+			 * Parent page index tuple expands. We need to update old
+			 * index tuple with the new one.
+			 */
+			gistDeCompressAtt(giststate, r,
+							  newtup, NULL, (OffsetNumber) 0,
+							  relocationBuffersInfos[which].entry, 
+							  relocationBuffersInfos[which].isnull);
+
+			relocationBuffersInfos[which].dist->itup = newtup;
+		}
+	}
+
+	if (blocknum == gfbb->currentEmptyingBufferBlockNumber)
+		gfbb->currentEmptyingBufferSplit = true;
+
+	pfree(relocationBuffersInfos);
+}
+
+/*
+ * Return size of node buffer occupied by stored index tuples.
+ */
+int
+getNodeBufferBusySize(NodeBuffer *nodeBuffer)
+{
+	int size;
+	
+	/* No occupied buffer file blocks means that node buffer is empty. */
+	if (nodeBuffer->blocksCount == 0)
+		return 0;
+	
+	/* We assume only the last page to be not fully filled. */
+	size = (BLCKSZ - MAXALIGN(sizeof(uint32))) * nodeBuffer->blocksCount;
+	size -= PAGE_FREE_SPACE(nodeBuffer->pageBuffer);
+	return size;	
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 1754a10..ca1a0a3 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -670,13 +670,33 @@ gistoptions(PG_FUNCTION_ARGS)
 {
 	Datum		reloptions = PG_GETARG_DATUM(0);
 	bool		validate = PG_GETARG_BOOL(1);
-	bytea	   *result;
+	relopt_value *options;
+	GiSTOptions *rdopts;
+	int			numoptions;
+	static const relopt_parse_elt tab[] = {
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(GiSTOptions, fillfactor)},
+		{"levelstep", RELOPT_TYPE_INT, offsetof(GiSTOptions, levelStep)},
+		{"buffersize", RELOPT_TYPE_INT, offsetof(GiSTOptions, bufferSize)},
+		{"neighborrelocation", RELOPT_TYPE_BOOL, offsetof(GiSTOptions, neighborRelocation)},
+		{"buffering", RELOPT_TYPE_STRING, offsetof(GiSTOptions, bufferingModeOffset)}
+	};
 
-	result = default_reloptions(reloptions, validate, RELOPT_KIND_GIST);
+	options = parseRelOptions(reloptions, validate, RELOPT_KIND_GIST,
+							  &numoptions);
+
+	/* if none set, we're done */
+	if (numoptions == 0)
+		PG_RETURN_NULL();
+
+	rdopts = allocateReloptStruct(sizeof(GiSTOptions), options, numoptions);
+
+	fillRelOptions((void *) rdopts, sizeof(GiSTOptions), options, numoptions,
+				   validate, tab, lengthof(tab));
+
+	pfree(options);
+
+	PG_RETURN_BYTEA_P(rdopts);
 
-	if (result)
-		PG_RETURN_BYTEA_P(result);
-	PG_RETURN_NULL();
 }
 
 /*
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 9fb20a6..b7f2bf8 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -17,13 +17,56 @@
 #include "access/gist.h"
 #include "access/itup.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "utils/rbtree.h"
+#include "utils/hsearch.h"
+
+/* Has specified level buffers? */
+#define LEVEL_HAS_BUFFERS(level,gfbb) ((level) != 0 && (level) % (gfbb)->levelStep == 0)
+/* Is specified buffer at least halt-filled (should be planned for emptying)?*/
+#define BUFFER_IS_OVERLOW(nodeBuffer,gfbb) ((nodeBuffer)->blocksCount > (gfbb)->pagesPerBuffer / 2)
 
 /* Buffer lock modes */
 #define GIST_SHARE	BUFFER_LOCK_SHARE
 #define GIST_EXCLUSIVE	BUFFER_LOCK_EXCLUSIVE
 #define GIST_UNLOCK BUFFER_LOCK_UNLOCK
 
+typedef struct
+{
+	BlockNumber prev;
+	uint32		freespace;
+	char		tupledata[1];
+} NodeBufferPage;
+
+/* Returns free space in node buffer page */
+#define PAGE_FREE_SPACE(nbp) (nbp->freespace)
+/* Checks if node buffer page is empty */
+#define PAGE_IS_EMPTY(nbp) (nbp->freespace == BLCKSZ - offsetof(NodeBufferPage, tupledata))
+/* Checks if node buffers page don't contain sufficient space for index tuple */
+#define PAGE_NO_SPACE(nbp, itup) (PAGE_FREE_SPACE(nbp) < \
+										MAXALIGN(IndexTupleSize(itup)))
+
+/* Buffer of tree node data structure */
+typedef struct NodeBuffer
+{
+    /* number of page containing node */
+	BlockNumber	 nodeBlocknum;
+ 
+    /* count of blocks occupied by buffer */
+	int			 blocksCount;
+
+	NodeBufferPage *pageBuffer;
+
+    /* corresponding downlink number in parent page */
+	OffsetNumber downlinkoffnum;
+    /* level of corresponding node */
+	int          level;
+    /* number of tuples in buffer */
+	int          tuplesCount;
+
+	struct GISTLoadedPartItem *path;
+} NodeBuffer;
+
 /*
  * GISTSTATE: information needed for any GiST index operation
  *
@@ -43,6 +86,8 @@ typedef struct GISTSTATE
 
 	/* Collations to pass to the support functions */
 	Oid			supportCollation[INDEX_MAX_KEYS];
+    
+    struct GISTBuildBuffers *gfbb;
 
 	TupleDesc	tupdesc;
 } GISTSTATE;
@@ -225,6 +270,86 @@ typedef struct GISTInsertStack
 	struct GISTInsertStack *parent;
 } GISTInsertStack;
 
+/*
+ * Extended GISTInsertStack for buffering GiST index build. It additionally hold
+ * level number of page.
+ */
+typedef struct GISTLoadedPartItem
+{
+	/* current page */
+	BlockNumber blkno;
+
+	/* offset of the downlink in the parent page, that points to this page */
+	OffsetNumber downlinkoffnum;
+
+	/* pointer to parent */
+	struct GISTLoadedPartItem *parent;
+
+	/* level number */
+	int level;
+} GISTLoadedPartItem;
+
+/* List of non-empty node buffers on specific level */
+typedef struct
+{
+	NodeBuffer *first, *last;
+} NodeBuffersOnLevel;
+
+/*
+ * Data structure with general information about build buffers.
+ */
+typedef struct GISTBuildBuffers
+{
+	/* memory context which is persistent during buffering build */
+	MemoryContext	context;    
+	/* underlying files */
+	BufFile		   *pfile;
+	/* # of blocks used in underlying files */
+	long			nFileBlocks;
+	/* is freeBlocks[] currently in order? */
+	bool			blocksSorted;
+	/* resizable array of free blocks */
+	long		   *freeBlocks;
+	/* # of currently free blocks */
+	int				nFreeBlocks;
+	/* current allocated length of freeBlocks[] */
+	int				freeBlocksLen;
+
+	/* hash for buffers by block number*/
+	HTAB		   *nodeBuffersTab;
+
+	/* stack of buffers for emptying */
+	List		   *bufferEmptyingStack;
+	/* number of currently emptying buffer */
+	BlockNumber     currentEmptyingBufferBlockNumber;
+	/* whether currently emptying buffer was split - a signal to stop emptying */
+	bool            currentEmptyingBufferSplit;
+
+	/* whether to use neighbor relocation of buffers when split */
+	bool            neighborRelocation;
+	/* step of levels for buffers location */
+	int             levelStep;
+	/* maximal number of pages occupied by buffer */
+	int             pagesPerBuffer;
+
+	GISTLoadedPartItem *rootitem;
+} GISTBuildBuffers;
+
+/*
+ * Information about sub-tree level planned for load.
+ */
+typedef struct
+{
+	/* pages of sub-tree level */
+	GISTLoadedPartItem **items;
+	/* lenght of items array */
+	int itemsLen;
+	/* number of pages */
+	int itemsCount;
+	/* level number in whole tree */
+	int level;
+} SubtreeLevelInfo;
+
 typedef struct GistSplitVector
 {
 	GIST_SPLITVEC splitVector;	/* to/from PickSplit method */
@@ -286,6 +411,15 @@ extern Datum gistinsert(PG_FUNCTION_ARGS);
 extern MemoryContext createTempGistContext(void);
 extern void initGISTstate(GISTSTATE *giststate, Relation index);
 extern void freeGISTstate(GISTSTATE *giststate);
+BlockNumber gistdoinsert(Relation r,
+			 IndexTuple itup,
+			 Size freespace,
+			 GISTSTATE *GISTstate);
+bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
+				 GISTSTATE *giststate,
+				 IndexTuple *tuples, int ntup, OffsetNumber oldoffnum,
+				 Buffer leftchild);
+
 
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
@@ -313,6 +447,19 @@ extern Datum gistgetbitmap(PG_FUNCTION_ARGS);
 
 /* gistutil.c */
 
+/*
+ * Storage type for GiST's reloptions
+ */
+typedef struct GiSTOptions
+{
+	int32		vl_len_;       /* varlena header (do not touch directly!) */
+	int			fillfactor;	   /* page fill factor in percent (0..100) */
+	int 		bufferingModeOffset;  /* use buffering build? */
+	bool		neighborRelocation;  /* use neighbor buffer relocation? */
+	int			levelStep;     /* level step in buffering build */
+	int			bufferSize;    /* buffer size in buffering build */
+} GiSTOptions;
+
 #define GiSTPageSize   \
 	( BLCKSZ - SizeOfPageHeaderData - MAXALIGN(sizeof(GISTPageOpaqueData)) )
 
@@ -380,4 +527,14 @@ extern void gistSplitByKey(Relation r, Page page, IndexTuple *itup,
 			   GistSplitVector *v, GistEntryVector *entryvec,
 			   int attno);
 
+/* gistbuildbuffers.c */
+
+void initGiSTBuildBuffers(GISTBuildBuffers * gfbb);
+extern NodeBuffer *getNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate, BlockNumber blkno, OffsetNumber downlinkoffnu, GISTLoadedPartItem *parent, bool createNew);
+void pushItupToNodeBuffer(GISTBuildBuffers *gfbb, NodeBuffer *nodeBuffer, IndexTuple item);
+bool popItupFromNodeBuffer(GISTBuildBuffers *gfbb, NodeBuffer *nodeBuffer, IndexTuple *item);
+void freeGiSTBuildBuffers(GISTBuildBuffers *gfbb);
+extern void relocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate, Relation r, GISTLoadedPartItem *path, Buffer buffer, SplitedPageLayout *dist);
+int getNodeBufferBusySize(NodeBuffer *nodeBuffer);
+
 #endif   /* GIST_PRIVATE_H */
#57Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#56)
Re: WIP: Fast GiST index build

On Tue, Jul 26, 2011 at 9:24 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

That was a quite off-the-cuff remark, so I took the patch and culled out
loaded-tree bookkeeping. A lot of other changes fell off from that, so it
took me quite some time to get it working again, but here it is. This is a
*lot* smaller patch, although that's partly explained by the fact that I
left out some features: prefetching and the neighbor relocation code is
gone.

I'm pretty exhausted by this, so I just wanted to send this out without
further analysis. Let me know if you have questions on the approach taken.
I'm also not sure how this performs compared to your latest patch, I haven't
done any performance testing. Feel free to use this as is, or as a source of
inspiration :-).

I also was going to try to evade keeping loaded-tree hash. This might help
me a lot. Thanks.

------
With best regards,
Alexander Korotkov.

#58Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#56)
Re: WIP: Fast GiST index build

I found a problem in WAL with this patch. I use simplified insert algorithm
in my patch which don't insert downlink one by one but insert them at once.
Thus FollowRight flag is leaving uncleared when redoing from WAL, because
only one flag can be cleared by one WAL record. Do you think modification of
WAL record structure is possible or I have to insert downlink one by one in
buffering build too?

------
With best regards,
Alexander Korotkov.

#59Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#58)
Re: WIP: Fast GiST index build

On 27.07.2011 15:29, Alexander Korotkov wrote:

I found a problem in WAL with this patch. I use simplified insert algorithm
in my patch which don't insert downlink one by one but insert them at once.
Thus FollowRight flag is leaving uncleared when redoing from WAL, because
only one flag can be cleared by one WAL record. Do you think modification of
WAL record structure is possible or I have to insert downlink one by one in
buffering build too?

Dunno, both approaches seem reasonable to me. There's no rule against
changing WAL record structure across major releases, if that's what you
were worried about.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#60Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#59)
Re: WIP: Fast GiST index build

On Wed, Jul 27, 2011 at 6:05 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Dunno, both approaches seem reasonable to me. There's no rule against
changing WAL record structure across major releases, if that's what you were
worried about.

OK, thanks. I also found behaviour of GiST(without patch) with streaming
replication that seems strange for me. On master there are only few
rightlinks are InvalidBlockNumber while on slave there are a lot of them. I
hack gevel for getting index structure on slave (accessing tree without
AccessExclusiveLock).

On master:
# create table test as (select point(random(),random()) from
generate_series(1,100000));
# create index test_idx on test using gist(point);
# \copy (select gist_tree('test_idx')) to 'tree1r.txt';

On slave:
# \copy (select gist_tree('test_idx')) to 'tree2r.txt';

In bash:
# cat tree1r.txt | sed 's/\\n/\n/g' > tree1.txt
# cat tree2r.txt | sed 's/\\n/\n/g' > tree2.txt
# diff tree1.txt tree2.txt

2,89c2,89
< 1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:637 (OK)
< 1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%) rightlink:319
(OK)
< 2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%) rightlink:153
(OK)
< 3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%) rightlink:551
(OK)
< 4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%) rightlink:555
(OK)
< 5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%) rightlink:74
(OK)
< 6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%) rightlink:294
(OK)
< 7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%) rightlink:567
(OK)
.....
---

1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:4294967295

(InvalidBlockNumber)

1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)

rightlink:4294967295 (InvalidBlockNumber)

2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)

rightlink:4294967295 (InvalidBlockNumber)

3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)

rightlink:4294967295 (InvalidBlockNumber)

4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)

rightlink:4294967295 (InvalidBlockNumber)

5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)

rightlink:4294967295 (InvalidBlockNumber)

6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)

rightlink:4294967295 (InvalidBlockNumber)

7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)

rightlink:4294967295 (InvalidBlockNumber)
.....

Isn't it a bug?

------
With best regards,
Alexander Korotkov.

#61Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#60)
Re: WIP: Fast GiST index build

On 27.07.2011 17:43, Alexander Korotkov wrote:

1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:4294967295

(InvalidBlockNumber)

1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)

rightlink:4294967295 (InvalidBlockNumber)

2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)

rightlink:4294967295 (InvalidBlockNumber)

3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)

rightlink:4294967295 (InvalidBlockNumber)

4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)

rightlink:4294967295 (InvalidBlockNumber)

5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)

rightlink:4294967295 (InvalidBlockNumber)

6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)

rightlink:4294967295 (InvalidBlockNumber)

7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)

rightlink:4294967295 (InvalidBlockNumber)
.....

Isn't it a bug?

Yeah, looks like a bug. I must've messed up the WAL logging in my recent
changes to this. I'll look into that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#62Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#56)
Re: WIP: Fast GiST index build

On Tue, Jul 26, 2011 at 9:24 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Let me know if you have questions on the approach taken.

I realized that approach which comes as replace to loaded-subtrees keeping
is unclear to me. We save paths between node buffers. But those paths can
become invalid on page splits. It seems to me that approximately same volume
of code for maintaining parent links should be added to this version of
patch in order to get it working correctly.

------
With best regards,
Alexander Korotkov.

#63Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#62)
Re: WIP: Fast GiST index build

On 28.07.2011 23:57, Alexander Korotkov wrote:

On Tue, Jul 26, 2011 at 9:24 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

Let me know if you have questions on the approach taken.

I realized that approach which comes as replace to loaded-subtrees keeping
is unclear to me. We save paths between node buffers. But those paths can
become invalid on page splits. It seems to me that approximately same volume
of code for maintaining parent links should be added to this version of
patch in order to get it working correctly.

gistFindCorrectParent() should take care of that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#64Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#63)
Re: WIP: Fast GiST index build

On Fri, Jul 29, 2011 at 1:10 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

gistFindCorrectParent() should take care of that.

OK, now it seems that I understood. I need to verify amount memory needed
for paths because it seems that they tends to accumulate. Also I need to
verify final emptying, because IO guarantees of original paper is based on
strict descending final emptying.

------
With best regards,
Alexander Korotkov.

#65Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#60)
1 attachment(s)
Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On 27.07.2011 17:43, Alexander Korotkov wrote:

OK, thanks. I also found behaviour of GiST(without patch) with streaming
replication that seems strange for me. On master there are only few
rightlinks are InvalidBlockNumber while on slave there are a lot of them. I
hack gevel for getting index structure on slave (accessing tree without
AccessExclusiveLock).

On master:
# create table test as (select point(random(),random()) from
generate_series(1,100000));
# create index test_idx on test using gist(point);
# \copy (select gist_tree('test_idx')) to 'tree1r.txt';

On slave:
# \copy (select gist_tree('test_idx')) to 'tree2r.txt';

In bash:
# cat tree1r.txt | sed 's/\\n/\n/g'> tree1.txt
# cat tree2r.txt | sed 's/\\n/\n/g'> tree2.txt
# diff tree1.txt tree2.txt

2,89c2,89
< 1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:637 (OK)
< 1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%) rightlink:319
(OK)
< 2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%) rightlink:153
(OK)
< 3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%) rightlink:551
(OK)
< 4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%) rightlink:555
(OK)
< 5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%) rightlink:74
(OK)
< 6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%) rightlink:294
(OK)
< 7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%) rightlink:567
(OK)
.....
---

1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:4294967295

(InvalidBlockNumber)

1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)

rightlink:4294967295 (InvalidBlockNumber)

2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)

rightlink:4294967295 (InvalidBlockNumber)

3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)

rightlink:4294967295 (InvalidBlockNumber)

4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)

rightlink:4294967295 (InvalidBlockNumber)

5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)

rightlink:4294967295 (InvalidBlockNumber)

6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)

rightlink:4294967295 (InvalidBlockNumber)

7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)

rightlink:4294967295 (InvalidBlockNumber)
.....

Isn't it a bug?

Yeah, it sure looks like a bug. I was certain that I had broken this in
the recent changes to GiST handling of page splits, but in fact it has
been like that forever.

The rightlinks are not needed after crash recovery, because all the
downlinks should be there. A scan will find all pages through the
downlinks, and doesn't need to follow any rightlinks. I'm not sure why
we explicitly clear them, it's not like the rightlinks would do any harm
either, but for crash recovery that's harmless.

But a scan during hot standby can see those intermediate states, just
like concurrent scans can on the master. The locking on replay of page
split needs to be fixed, too. At the moment, it locks and writes out
each page separately, so a concurrent scan could "overtake" the WAL
replay while following rightlinks, and miss tuples on the right half.

Attached is a patch for that for 9.1/master. The 9.0 GiST replay code
was quite different, it will require a separate patch.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

fix-gist-hot-standby-1.patchtext/x-diff; name=fix-gist-hot-standby-1.patchDownload
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 02c4ec3..60fc173 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -151,7 +151,6 @@ gistRedoPageUpdateRecord(XLogRecPtr lsn, XLogRecord *record)
 		 */
 		GistPageSetLeaf(page);
 
-	GistPageGetOpaque(page)->rightlink = InvalidBlockNumber;
 	PageSetLSN(page, lsn);
 	PageSetTLI(page, ThisTimeLineID);
 	MarkBufferDirty(buffer);
@@ -222,16 +221,28 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 	Page		page;
 	int			i;
 	bool		isrootsplit = false;
+	Buffer	   *buffers;
 
+	/*
+	 * If this split inserted a downlink for a child at lower level, we can
+	 * now set the NSN and clear the follow-right flag on that child. It's
+	 * OK to do this before locking the parent page. If a concurrent scan
+	 * reads this parent page after we've already cleared the follow-right
+	 * flag on the child, it'll still follow the rightlink because of the
+	 * NSN.
+	 */
 	if (BlockNumberIsValid(xldata->leftchild))
 		gistRedoClearFollowRight(xldata->node, lsn, xldata->leftchild);
 	decodePageSplitRecord(&xlrec, record);
 
-	/* loop around all pages */
+	/*
+	 * Lock all the pages involved in the split first, so that any concurrent
+	 * scans in hot standby mode will see the split as an atomic operation.
+	 */
+	buffers = palloc(xlrec.data->npage * sizeof(Buffer));
 	for (i = 0; i < xlrec.data->npage; i++)
 	{
 		NewPage    *newpage = xlrec.page + i;
-		int			flags;
 
 		if (newpage->header->blkno == GIST_ROOT_BLKNO)
 		{
@@ -239,8 +250,19 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 			isrootsplit = true;
 		}
 
-		buffer = XLogReadBuffer(xlrec.data->node, newpage->header->blkno, true);
-		Assert(BufferIsValid(buffer));
+		buffers[i] = XLogReadBuffer(xlrec.data->node,
+									newpage->header->blkno,
+									true);
+		Assert(BufferIsValid(buffers[i]));
+	}
+
+	/* Write out all the pages */
+	for (i = 0; i < xlrec.data->npage; i++)
+	{
+		NewPage    *newpage = xlrec.page + i;
+		int			flags;
+
+		buffer = buffers[i];
 		page = (Page) BufferGetPage(buffer);
 
 		/* ok, clear buffer */
@@ -277,6 +299,8 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 		MarkBufferDirty(buffer);
 		UnlockReleaseBuffer(buffer);
 	}
+
+	pfree(buffers);
 }
 
 static void
#66Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#65)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On Mon, Aug 1, 2011 at 10:38 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 27.07.2011 17:43, Alexander Korotkov wrote:

OK, thanks. I also found behaviour of GiST(without patch) with streaming
replication that seems strange for me. On master there are only few
rightlinks are InvalidBlockNumber while on slave there are a lot of them.
I
hack gevel for getting index structure on slave (accessing tree without
AccessExclusiveLock).

On master:
# create table test as (select point(random(),random()) from
generate_series(1,100000));
# create index test_idx on test using gist(point);
# \copy (select gist_tree('test_idx')) to 'tree1r.txt';

On slave:
# \copy (select gist_tree('test_idx')) to 'tree2r.txt';

In bash:
# cat tree1r.txt | sed 's/\\n/\n/g'>  tree1.txt
# cat tree2r.txt | sed 's/\\n/\n/g'>  tree2.txt
# diff tree1.txt tree2.txt

2,89c2,89
<      1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:637
(OK)
<          1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%) rightlink:319
(OK)
<          2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%) rightlink:153
(OK)
<          3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%) rightlink:551
(OK)
<          4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%) rightlink:555
(OK)
<          5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%) rightlink:74
(OK)
<          6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%) rightlink:294
(OK)
<          7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%) rightlink:567
(OK)
.....
---

    1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%)
rightlink:4294967295

(InvalidBlockNumber)

        1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)

rightlink:4294967295 (InvalidBlockNumber)

        2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)

rightlink:4294967295 (InvalidBlockNumber)

        3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)

rightlink:4294967295 (InvalidBlockNumber)

        4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)

rightlink:4294967295 (InvalidBlockNumber)

        5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)

rightlink:4294967295 (InvalidBlockNumber)

        6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)

rightlink:4294967295 (InvalidBlockNumber)

        7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)

rightlink:4294967295 (InvalidBlockNumber)
.....

Isn't it a bug?

Yeah, it sure looks like a bug. I was certain that I had broken this in the
recent changes to GiST handling of page splits, but in fact it has been like
that forever.

The rightlinks are not needed after crash recovery, because all the
downlinks should be there. A scan will find all pages through the downlinks,
and doesn't need to follow any rightlinks. I'm not sure why we explicitly
clear them, it's not like the rightlinks would do any harm either, but for
crash recovery that's harmless.

But a scan during hot standby can see those intermediate states, just like
concurrent scans can on the master. The locking on replay of page split
needs to be fixed, too. At the moment, it locks and writes out each page
separately, so a concurrent scan could "overtake" the WAL replay while
following rightlinks, and miss tuples on the right half.

Attached is a patch for that for 9.1/master. The 9.0 GiST replay code was
quite different, it will require a separate patch.

Hmm, I was assured no changes would be required for Hot Standby for
GIST and GIN. Perhaps we should check GIN code also.

Does the order of locking of the buffers matter? I'm sure it does.

Did you want me to write the patch for 9.0?

And what does NSN stand for? :-)

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#67Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#66)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On 01.08.2011 13:13, Simon Riggs wrote:

On Mon, Aug 1, 2011 at 10:38 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Attached is a patch for that for 9.1/master. The 9.0 GiST replay code was
quite different, it will require a separate patch.

Hmm, I was assured no changes would be required for Hot Standby for
GIST and GIN. Perhaps we should check GIN code also.

Yeah, we probably should.

Does the order of locking of the buffers matter? I'm sure it does.

Yep.

Did you want me to write the patch for 9.0?

I'm looking at it now.

And what does NSN stand for? :-)

Hmm, I don't know actually know what NSN is an acronym for :-). But in
case you want an explanation of what it does:

The NSN is field in the GiST page header, used to detect concurrent page
splits. Whenever a page is split, its NSN is set to the LSN of the page
split record. To be precise: the NSN of the resulting left page(s) is
set, the resulting rightmost half keeps the NSN of the original page.

When you scan a parent page and decide to move down, it's possible that
the child page is split before you read it, but after you read the
parent page. So you didn't see the downlink for the right half when you
scanned the parent page. To reach the right half, you need to follow the
rightlink from the left page, but first you need to detect that the page
was split. To do that, when you scan the parent page you remember the
LSN on the parent. When you scan the child, you compare the parent's LSN
you saw with the NSN of the child. If the child's NSN > parent's LSN,
the page was split after you scanned the parent, so you need to follow
the rightlink.

The B-tree code has similar move-right logic, but it uses the "high" key
on each page to decide when it needs to move right. There's no high key
on GiST pages, so we rely on the NSN for that.

In 9.1, I added the F_FOLLOW_RIGHT flag to handle the case of an
incomplete split correctly. If the flag is set on a page, its rightlink
needs to be followed regardless of the NSN.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#68Thom Brown
thom@linux.com
In reply to: Heikki Linnakangas (#67)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On 1 August 2011 11:44, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 01.08.2011 13:13, Simon Riggs wrote:

And what does NSN stand for? :-)

Hmm, I don't know actually know what NSN is an acronym for :-).

Node Sequence Number.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#69Simon Riggs
simon@2ndQuadrant.com
In reply to: Thom Brown (#68)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On Mon, Aug 1, 2011 at 11:47 AM, Thom Brown <thom@linux.com> wrote:

On 1 August 2011 11:44, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 01.08.2011 13:13, Simon Riggs wrote:

And what does NSN stand for? :-)

Hmm, I don't know actually know what NSN is an acronym for :-).

Node Sequence Number.

Do you have a reference to support that explanation?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#70Thom Brown
thom@linux.com
In reply to: Simon Riggs (#69)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On 1 August 2011 12:25, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, Aug 1, 2011 at 11:47 AM, Thom Brown <thom@linux.com> wrote:

On 1 August 2011 11:44, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 01.08.2011 13:13, Simon Riggs wrote:

And what does NSN stand for? :-)

Hmm, I don't know actually know what NSN is an acronym for :-).

Node Sequence Number.

Do you have a reference to support that explanation?

Here's one reference to it:
http://archives.postgresql.org/pgsql-hackers/2005-06/msg00294.php

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#71Alexander Korotkov
aekorotkov@gmail.com
In reply to: Simon Riggs (#69)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On Mon, Aug 1, 2011 at 3:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, Aug 1, 2011 at 11:47 AM, Thom Brown <thom@linux.com> wrote:

On 1 August 2011 11:44, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 01.08.2011 13:13, Simon Riggs wrote:

And what does NSN stand for? :-)

Hmm, I don't know actually know what NSN is an acronym for :-).

Node Sequence Number.

Do you have a reference to support that explanation?

See "Access Methods for Next-Generation Database Systems" by Marcel
Kornacker, Chapter 4 "Concurrency and Recovery for GiSTs".
http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz

------
With best regards,
Alexander Korotkov.

#72Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#67)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On Mon, Aug 1, 2011 at 11:44 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Does the order of locking of the buffers matter? I'm sure it does.

Yep.

Do you mean that the BlockNumbers are already in correct sequence, or
that you will be adding this code to redo?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#73Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#72)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On 01.08.2011 14:35, Simon Riggs wrote:

On Mon, Aug 1, 2011 at 11:44 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Does the order of locking of the buffers matter? I'm sure it does.

Yep.

Do you mean that the BlockNumbers are already in correct sequence, or
that you will be adding this code to redo?

I just meant that yes, the order of locking of the buffers does matter.

I believe we code acquire the locks in right order already, and the
patch I posted fixes the premature release of locks at page split.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#74Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#73)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On Mon, Aug 1, 2011 at 2:29 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 01.08.2011 14:35, Simon Riggs wrote:

On Mon, Aug 1, 2011 at 11:44 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

Does the order of locking of the buffers matter? I'm sure it does.

Yep.

Do you mean that the BlockNumbers are already in correct sequence, or
that you will be adding this code to redo?

I just meant that yes, the order of locking of the buffers does matter.

I believe we code acquire the locks in right order already, and the patch I
posted fixes the premature release of locks at page split.

Your patch is good, but it does rely on the idea that we're logging
the blocks in the same order they were originally locked. That's a
good assumption, but I would like to see that documented for general
sanity, or just mine at least.

I can't really see anything in the master-side code that attempts to
lock things in a specific sequence, which bothers me also.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#75Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#74)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On 01.08.2011 17:26, Simon Riggs wrote:

On Mon, Aug 1, 2011 at 2:29 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I believe we code acquire the locks in right order already, and the patch I
posted fixes the premature release of locks at page split.

Your patch is good, but it does rely on the idea that we're logging
the blocks in the same order they were originally locked. That's a
good assumption, but I would like to see that documented for general
sanity, or just mine at least.

I can't really see anything in the master-side code that attempts to
lock things in a specific sequence, which bothers me also.

All but the first page are unused pages, grabbed with either P_NEW or
from the FSM. gistNewBuffer() uses ConditionalLockBuffer() to guard for
the case that someone else chooses the same victim buffer, and picks
another page.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#76Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#75)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On Mon, Aug 1, 2011 at 3:34 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 01.08.2011 17:26, Simon Riggs wrote:

On Mon, Aug 1, 2011 at 2:29 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

I believe we code acquire the locks in right order already, and the patch
I
posted fixes the premature release of locks at page split.

Your patch is good, but it does rely on the idea that we're logging
the blocks in the same order they were originally locked. That's a
good assumption, but I would like to see that documented for general
sanity, or just mine at least.

I can't really see anything in the master-side code that attempts to
lock things in a specific sequence, which bothers me also.

All but the first page are unused pages, grabbed with either P_NEW or from
the FSM. gistNewBuffer() uses ConditionalLockBuffer() to guard for the case
that someone else chooses the same victim buffer, and picks another page.

Seems good. Thanks for checking some more for me.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#77Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#67)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On 01.08.2011 13:44, Heikki Linnakangas wrote:

On 01.08.2011 13:13, Simon Riggs wrote:

Did you want me to write the patch for 9.0?

I'm looking at it now.

So, in 9.0, we currently leave the rightlink and NSN invalid when
replaying a page split. To set them correctly, we'd need the old
rightlink and NSN from the page being split, but the WAL record doesn't
currently include them. I can see two solutions to this:

1. Add them to the page split WAL record. That's straightforward, but
breaks WAL format compatibility with older minor versions.

2. Read the old page version, and copy the rightlink and NSN from there.
Since we're restoring what's basically a full-page image of the page
after the split, in crash recovery the old contents might be a torn
page, or a newer version of the page. I believe that's harmless, because
we only care about the NSN and rightlink in hot standby mode, but it's a
bit ugly.

If we change the WAL record, we have to make it so that the new version
can still read the old format, which complicates the implementation a
bit. Neverthelss, I'm leaning towards option 1.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#78Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#77)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On Tue, Aug 2, 2011 at 12:03 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 01.08.2011 13:44, Heikki Linnakangas wrote:

On 01.08.2011 13:13, Simon Riggs wrote:

Did you want me to write the patch for 9.0?

I'm looking at it now.

So, in 9.0, we currently leave the rightlink and NSN invalid when replaying
a page split. To set them correctly, we'd need the old rightlink and NSN
from the page being split, but the WAL record doesn't currently include
them. I can see two solutions to this:

1. Add them to the page split WAL record. That's straightforward, but breaks
WAL format compatibility with older minor versions.

2. Read the old page version, and copy the rightlink and NSN from there.
Since we're restoring what's basically a full-page image of the page after
the split, in crash recovery the old contents might be a torn page, or a
newer version of the page. I believe that's harmless, because we only care
about the NSN and rightlink in hot standby mode, but it's a bit ugly.

If we change the WAL record, we have to make it so that the new version can
still read the old format, which complicates the implementation a bit.
Neverthelss, I'm leaning towards option 1.

We may as well do (1), with two versions of the WAL record.

Hmm, the biggest issue is actually that existing GIST indexes are
corrupted, from the perspective of being unusable during HS.

We can fix the cause but that won't repair the existing damage. So the
requirement is for us to re/create new indexes, which can then use a
new WAL record format. We probably need to store some information in
the metapage saying whether or not the index has been maintained only
with v2 WAL records, or with a mixture of v1 and v2 records. If the
latter, then issue a WARNING to rebuild the index.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#79Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#78)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On 02.08.2011 14:36, Simon Riggs wrote:

On Tue, Aug 2, 2011 at 12:03 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

If we change the WAL record, we have to make it so that the new version can
still read the old format, which complicates the implementation a bit.
Neverthelss, I'm leaning towards option 1.

We may as well do (1), with two versions of the WAL record.

Actually I think we can append the new information to the end of the
page split record, so that an old version server can read WAL generated
by new version, too. It just won't set the right link and NSN correctly,
so hot standby will be broken like it is today.

Hmm, the biggest issue is actually that existing GIST indexes are
corrupted, from the perspective of being unusable during HS.

We can fix the cause but that won't repair the existing damage. So the
requirement is for us to re/create new indexes, which can then use a
new WAL record format.

No-no, it's not that bad. The right-links and NSNs are only needed to
handle scans concurrent with page splits. The existing indexes are fine,
you only have a problem if you run queries in hot standby mode, while
you replay page splits on it. As soon as you upgrade the master and
standby to new minor version with the fix, that will work too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#80Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#79)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On Tue, Aug 2, 2011 at 12:43 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 02.08.2011 14:36, Simon Riggs wrote:

On Tue, Aug 2, 2011 at 12:03 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

If we change the WAL record, we have to make it so that the new version
can
still read the old format, which complicates the implementation a bit.
Neverthelss, I'm leaning towards option 1.

We may as well do (1), with two versions of the WAL record.

Actually I think we can append the new information to the end of the page
split record, so that an old version server can read WAL generated by new
version, too.

Not sure how that would work. Lengths, CRCs?

Or do you mean we will support 2 versions, have them both called the
same thing, just resolve which is which by the record length. Don't
like that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#81Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#64)
1 attachment(s)
Re: WIP: Fast GiST index build

Hi!

I'm now working on adding features to your version of patch. Current version
is attached. Somehow this version produce huge amount of WAL and that makes
it slow. Though count and avg. length of WAL records is similar to that of
non-buffering build.

test=# create table points as (select point(random(),random()) from
generate_series(1,1000000));
SELECT 1000000
test=# select pg_xlogfile_name_offset(pg_current_xlog_location());
pg_xlogfile_name_offset
-------------------------------------
(000000010000004000000073,15005048)
(1 row)

test=# create index points_idx on points using gist (point) with
(buffering=off);CREATE INDEX
test=# select pg_xlogfile_name_offset(pg_current_xlog_location());
pg_xlogfile_name_offset
-------------------------------------
(00000001000000400000007E,13764024)
(1 row)

test=# create index points_idx2 on points using gist (point) with
(buffering=on, neighborrelocation=off);
INFO: Level step = 1, pagesPerBuffer = 406
NOTICE: final emptying
NOTICE: final emptying
NOTICE: final emptying
NOTICE: final emptying
CREATE INDEX
test=# select pg_xlogfile_name_offset(pg_current_xlog_location());
pg_xlogfile_name_offset
-------------------------------------
(0000000100000040000000D2,10982288)
(1 row)

May be you have any ideas about it?

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.9.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.9.0.patch.gzDownload
#82Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#80)
1 attachment(s)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On 02.08.2011 15:18, Simon Riggs wrote:

On Tue, Aug 2, 2011 at 12:43 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 02.08.2011 14:36, Simon Riggs wrote:
Actually I think we can append the new information to the end of the page
split record, so that an old version server can read WAL generated by new
version, too.

Not sure how that would work. Lengths, CRCs?

Or do you mean we will support 2 versions, have them both called the
same thing, just resolve which is which by the record length. Don't
like that.

Here's a patch to do what I meant. The new fields are stored at the very
end of the WAL record, and you check the length to see if they're there
or not. The nice thing about this is that it's compatible in both
directions.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

gist-split-hotstandby-90.patchtext/x-diff; name=gist-split-hotstandby-90.patchDownload
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 02c4ec3..60fc173 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -151,7 +151,6 @@ gistRedoPageUpdateRecord(XLogRecPtr lsn, XLogRecord *record)
 		 */
 		GistPageSetLeaf(page);
 
-	GistPageGetOpaque(page)->rightlink = InvalidBlockNumber;
 	PageSetLSN(page, lsn);
 	PageSetTLI(page, ThisTimeLineID);
 	MarkBufferDirty(buffer);
@@ -222,16 +221,28 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 	Page		page;
 	int			i;
 	bool		isrootsplit = false;
+	Buffer	   *buffers;
 
+	/*
+	 * If this split inserted a downlink for a child at lower level, we can
+	 * now set the NSN and clear the follow-right flag on that child. It's
+	 * OK to do this before locking the parent page. If a concurrent scan
+	 * reads this parent page after we've already cleared the follow-right
+	 * flag on the child, it'll still follow the rightlink because of the
+	 * NSN.
+	 */
 	if (BlockNumberIsValid(xldata->leftchild))
 		gistRedoClearFollowRight(xldata->node, lsn, xldata->leftchild);
 	decodePageSplitRecord(&xlrec, record);
 
-	/* loop around all pages */
+	/*
+	 * Lock all the pages involved in the split first, so that any concurrent
+	 * scans in hot standby mode will see the split as an atomic operation.
+	 */
+	buffers = palloc(xlrec.data->npage * sizeof(Buffer));
 	for (i = 0; i < xlrec.data->npage; i++)
 	{
 		NewPage    *newpage = xlrec.page + i;
-		int			flags;
 
 		if (newpage->header->blkno == GIST_ROOT_BLKNO)
 		{
@@ -239,8 +250,19 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 			isrootsplit = true;
 		}
 
-		buffer = XLogReadBuffer(xlrec.data->node, newpage->header->blkno, true);
-		Assert(BufferIsValid(buffer));
+		buffers[i] = XLogReadBuffer(xlrec.data->node,
+									newpage->header->blkno,
+									true);
+		Assert(BufferIsValid(buffers[i]));
+	}
+
+	/* Write out all the pages */
+	for (i = 0; i < xlrec.data->npage; i++)
+	{
+		NewPage    *newpage = xlrec.page + i;
+		int			flags;
+
+		buffer = buffers[i];
 		page = (Page) BufferGetPage(buffer);
 
 		/* ok, clear buffer */
@@ -277,6 +299,8 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 		MarkBufferDirty(buffer);
 		UnlockReleaseBuffer(buffer);
 	}
+
+	pfree(buffers);
 }
 
 static void
#83Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#1)
1 attachment(s)
Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

On 02.08.2011 20:06, Alvaro Herrera wrote:

Excerpts from Heikki Linnakangas's message of mar ago 02 11:59:24 -0400 2011:

On 02.08.2011 15:18, Simon Riggs wrote:

On Tue, Aug 2, 2011 at 12:43 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 02.08.2011 14:36, Simon Riggs wrote:
Actually I think we can append the new information to the end of the page
split record, so that an old version server can read WAL generated by new
version, too.

Not sure how that would work. Lengths, CRCs?

Or do you mean we will support 2 versions, have them both called the
same thing, just resolve which is which by the record length. Don't
like that.

Here's a patch to do what I meant. The new fields are stored at the very
end of the WAL record, and you check the length to see if they're there
or not. The nice thing about this is that it's compatible in both
directions.

Err, did you attach the wrong patch?

Yes, sorry about that. Here's the right patch.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

gist-split-hotstandby-90.patchtext/x-diff; name=gist-split-hotstandby-90.patchDownload
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 82ba726..71c145d 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -377,9 +377,18 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate)
 			state->ituplen++;
 		}
 
-		/* saves old rightlink */
+		/* save old rightlink and NSN */
 		if (state->stack->blkno != GIST_ROOT_BLKNO)
+		{
 			rrlink = GistPageGetOpaque(dist->page)->rightlink;
+			oldnsn = GistPageGetOpaque(dist->page)->nsn;
+		}
+		else
+		{
+			/* if root split we should put initial value */
+			rrlink = InvalidBlockNumber;
+			oldnsn = PageGetLSN(dist->page);
+		}
 
 		START_CRIT_SECTION();
 
@@ -407,7 +416,8 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate)
 			XLogRecData *rdata;
 
 			rdata = formSplitRdata(state->r->rd_node, state->stack->blkno,
-								   is_leaf, &(state->key), dist);
+								   is_leaf, &(state->key), dist,
+								   rrlink, &oldnsn);
 
 			recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_SPLIT, rdata);
 
@@ -425,12 +435,6 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate)
 			}
 		}
 
-		/* set up NSN */
-		oldnsn = GistPageGetOpaque(dist->page)->nsn;
-		if (state->stack->blkno == GIST_ROOT_BLKNO)
-			/* if root split we should put initial value */
-			oldnsn = PageGetLSN(dist->page);
-
 		for (ptr = dist; ptr; ptr = ptr->next)
 		{
 			/* only for last set oldnsn */
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 7f5dd99..cdd8aaf 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -39,6 +39,8 @@ typedef struct
 {
 	gistxlogPageSplit *data;
 	NewPage    *page;
+	BlockNumber origrlink;
+	XLogRecPtr	orignsn;
 } PageSplitRecord;
 
 /* track for incomplete inserts, idea was taken from nbtxlog.c */
@@ -250,7 +252,6 @@ gistRedoPageUpdateRecord(XLogRecPtr lsn, XLogRecord *record, bool isnewroot)
 		 */
 		GistPageSetLeaf(page);
 
-	GistPageGetOpaque(page)->rightlink = InvalidBlockNumber;
 	PageSetLSN(page, lsn);
 	PageSetTLI(page, ThisTimeLineID);
 	MarkBufferDirty(buffer);
@@ -310,6 +311,26 @@ decodePageSplitRecord(PageSplitRecord *decoded, XLogRecord *record)
 			j++;
 		}
 	}
+
+	/*
+	 * Starting with 9.0.5, the original NSN and rightlink on the split page
+	 * are stored here. It would've been more logical to add them to the
+	 * gistxlogPageSplit struct, but that would've broken compatibility with
+	 * the pre-9.0.5 WAL format.
+	 */
+	if (ptr - begin < record->xl_len)
+	{
+		memcpy(&decoded->origrlink, ptr, sizeof(BlockNumber));
+		ptr += sizeof(BlockNumber);
+		memcpy(&decoded->orignsn, ptr, sizeof(XLogRecPtr));
+	}
+	else
+	{
+		/* pre-9.0.5 format, no rightlink/NSN information */
+		decoded->origrlink = InvalidBlockNumber;
+		decoded->orignsn.xlogid = 0;
+		decoded->orignsn.xrecoff = 0;
+	}
 }
 
 static void
@@ -320,17 +341,32 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 	Page		page;
 	int			i;
 	int			flags;
+	Buffer	   *buffers;
 
 	decodePageSplitRecord(&xlrec, record);
 	flags = xlrec.data->origleaf ? F_LEAF : 0;
 
-	/* loop around all pages */
+	/*
+	 * Lock all the pages involved in the split first, so that any concurrent
+	 * scans in hot standby mode will see the split as an atomic operation.
+	 */
+	buffers = palloc(xlrec.data->npage * sizeof(Buffer));
 	for (i = 0; i < xlrec.data->npage; i++)
 	{
 		NewPage    *newpage = xlrec.page + i;
 
-		buffer = XLogReadBuffer(xlrec.data->node, newpage->header->blkno, true);
-		Assert(BufferIsValid(buffer));
+		buffers[i] = XLogReadBuffer(xlrec.data->node,
+									newpage->header->blkno,
+									true);
+		page = (Page) BufferGetPage(buffers[i]);
+	}
+
+	/* Write out all the pages */
+	for (i = 0; i < xlrec.data->npage; i++)
+	{
+		NewPage    *newpage = xlrec.page + i;
+
+		buffer = buffers[i];
 		page = (Page) BufferGetPage(buffer);
 
 		/* ok, clear buffer */
@@ -339,6 +375,18 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 		/* and fill it */
 		gistfillbuffer(page, newpage->itup, newpage->header->num, FirstOffsetNumber);
 
+		/* Set NSN and rightlink, needed for concurrent scans in hot standby */
+		if (i == xlrec.data->npage - 1)
+		{
+			GistPageGetOpaque(page)->nsn = xlrec.orignsn;
+			GistPageGetOpaque(page)->rightlink = xlrec.origrlink;
+		}
+		else
+		{
+			GistPageGetOpaque(page)->nsn = lsn;
+			GistPageGetOpaque(page)->rightlink = xlrec.page[i + 1].header->blkno;
+		}
+
 		PageSetLSN(page, lsn);
 		PageSetTLI(page, ThisTimeLineID);
 		MarkBufferDirty(buffer);
@@ -350,6 +398,8 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 	pushIncompleteInsert(xlrec.data->node, lsn, xlrec.data->key,
 						 NULL, 0,
 						 &xlrec);
+
+	pfree(buffers);
 }
 
 static void
@@ -655,6 +705,8 @@ gistContinueInsert(gistIncompleteInsert *insert)
 			XLogRecPtr	recptr;
 			Buffer		tempbuffer = InvalidBuffer;
 			int			ntodelete = 0;
+			BlockNumber	rrlink;
+			XLogRecPtr	oldnsn;
 
 			numbuffer = 1;
 			buffers[0] = ReadBuffer(index, insert->path[i]);
@@ -691,6 +743,10 @@ gistContinueInsert(gistIncompleteInsert *insert)
 			if (ntodelete == 0)
 				elog(PANIC, "gistContinueInsert: cannot find pointer to page(s)");
 
+			/* Remember old rightlink and NSN */
+			rrlink = GistPageGetOpaque(pages[0])->rightlink;
+			oldnsn = GistPageGetOpaque(pages[0])->nsn;
+
 			/*
 			 * we check space with subtraction only first tuple to delete,
 			 * hope, that wiil be enough space....
@@ -742,7 +798,8 @@ gistContinueInsert(gistIncompleteInsert *insert)
 				xlinfo = XLOG_GIST_PAGE_SPLIT;
 				rdata = formSplitRdata(index->rd_node, insert->path[i],
 									   false, &(insert->key),
-									 gistMakePageLayout(buffers, numbuffer));
+									   gistMakePageLayout(buffers, numbuffer),
+									   rrlink, &oldnsn);
 
 			}
 			else
@@ -849,7 +906,8 @@ gist_safe_restartpoint(void)
 
 XLogRecData *
 formSplitRdata(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
-			   ItemPointer key, SplitedPageLayout *dist)
+			   ItemPointer key, SplitedPageLayout *dist,
+			   BlockNumber origrlink, XLogRecPtr *orignsn)
 {
 	XLogRecData *rdata;
 	gistxlogPageSplit *xlrec = (gistxlogPageSplit *) palloc(sizeof(gistxlogPageSplit));
@@ -864,7 +922,7 @@ formSplitRdata(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
 		ptr = ptr->next;
 	}
 
-	rdata = (XLogRecData *) palloc(sizeof(XLogRecData) * (npage * 2 + 2));
+	rdata = (XLogRecData *) palloc(sizeof(XLogRecData) * (npage * 2 + 4));
 
 	xlrec->node = node;
 	xlrec->origblkno = blkno;
@@ -893,11 +951,24 @@ formSplitRdata(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
 		rdata[cur].data = (char *) (ptr->list);
 		rdata[cur].len = ptr->lenlist;
 		rdata[cur - 1].next = &(rdata[cur]);
-		rdata[cur].next = NULL;
 		cur++;
 		ptr = ptr->next;
 	}
 
+	/* Append origin rightlink and NSN */
+	rdata[cur].buffer = InvalidBuffer;
+	rdata[cur].data = (char *) &origrlink;
+	rdata[cur].len = sizeof(BlockNumber);
+	rdata[cur - 1].next = &(rdata[cur]);
+	cur++;
+
+	rdata[cur].buffer = InvalidBuffer;
+	rdata[cur].data = (char *) orignsn;
+	rdata[cur].len = sizeof(XLogRecPtr);
+	rdata[cur - 1].next = &(rdata[cur]);
+
+	rdata[cur].next = NULL;
+
 	return rdata;
 }
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 4df5fed..d4c8f04 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -260,7 +260,8 @@ extern XLogRecData *formUpdateRdata(RelFileNode node, Buffer buffer,
 
 extern XLogRecData *formSplitRdata(RelFileNode node,
 			   BlockNumber blkno, bool page_is_leaf,
-			   ItemPointer key, SplitedPageLayout *dist);
+			   ItemPointer key, SplitedPageLayout *dist,
+			   BlockNumber origrlink, XLogRecPtr *orignsn);
 
 extern XLogRecPtr gistxlogInsertCompletion(RelFileNode node, ItemPointerData *keys, int len);
 
#84Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#81)
1 attachment(s)
Re: WIP: Fast GiST index build

I found that in previous version of patch I missed PageSetLSN
and PageSetTLI, but huge amount of WAL is still here. Also I found that huge
amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
messages "FATAL: xlog flush request BFF11148/809A600 is not satisfied ---
flushed only to 44/9C518750" appears. Seems that there is some totally wrong
use of WAL if even optimization level does matter...

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.9.1.patch.gzapplication/x-gzip; name=gist_fast_build-0.9.1.patch.gzDownload
#85Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#84)
Re: WIP: Fast GiST index build

On Wed, Aug 3, 2011 at 4:18 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

I found that in previous version of patch I missed PageSetLSN
and PageSetTLI, but huge amount of WAL is still here. Also I found that huge
amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
messages "FATAL:  xlog flush request BFF11148/809A600 is not satisfied ---
flushed only to 44/9C518750" appears. Seems that there is some totally wrong
use of WAL if even optimization level does matter...

Try setting wal_debug=true to see what records are getting emitted.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#86Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#84)
Re: WIP: Fast GiST index build

On 03.08.2011 11:18, Alexander Korotkov wrote:

I found that in previous version of patch I missed PageSetLSN
and PageSetTLI, but huge amount of WAL is still here. Also I found that huge
amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
messages "FATAL: xlog flush request BFF11148/809A600 is not satisfied ---
flushed only to 44/9C518750" appears. Seems that there is some totally wrong
use of WAL if even optimization level does matter...

Try this:

diff --git a/src/backend/access/gist/gistbuild.c 
b/src/backend/access/gist/gistbuild.c
index 5099330..5a441e0 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -478,7 +478,7 @@ bufferingbuildinsert(GISTInsertState *state,
  		/* Write the WAL record */
  		if (RelationNeedsWAL(state->r))
  		{
-			gistXLogUpdate(state->r->rd_node, buffer, oldoffnum, noldoffnum,
+			recptr = gistXLogUpdate(state->r->rd_node, buffer, oldoffnum, 
noldoffnum,
  													itup, ntup,	InvalidBuffer);
  			PageSetLSN(page, recptr);
  			PageSetTLI(page, ThisTimeLineID);

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#87Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#86)
Re: WIP: Fast GiST index build

Uhh, my bad, really stupid bug. Many thanks.

------
With best regards,
Alexander Korotkov.

On Wed, Aug 3, 2011 at 8:31 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Show quoted text

On 03.08.2011 11:18, Alexander Korotkov wrote:

I found that in previous version of patch I missed PageSetLSN
and PageSetTLI, but huge amount of WAL is still here. Also I found that
huge
amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
messages "FATAL: xlog flush request BFF11148/809A600 is not satisfied ---
flushed only to 44/9C518750" appears. Seems that there is some totally
wrong
use of WAL if even optimization level does matter...

Try this:

diff --git a/src/backend/access/gist/**gistbuild.c
b/src/backend/access/gist/**gistbuild.c
index 5099330..5a441e0 100644
--- a/src/backend/access/gist/**gistbuild.c
+++ b/src/backend/access/gist/**gistbuild.c
@@ -478,7 +478,7 @@ bufferingbuildinsert(**GISTInsertState *state,
/* Write the WAL record */
if (RelationNeedsWAL(state->r))
{
-                       gistXLogUpdate(state->r->rd_**node, buffer,
oldoffnum, noldoffnum,
+                       recptr = gistXLogUpdate(state->r->rd_**node,
buffer, oldoffnum, noldoffnum,

itup, ntup, InvalidBuffer);
PageSetLSN(page, recptr);
PageSetTLI(page, ThisTimeLineID);

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#88Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#87)
1 attachment(s)
Re: WIP: Fast GiST index build

Hi!

There is last version of patch. There is the list of most significant
changes in comparison with your version of patch:
1) Reference counting of path items was added. It should helps against
possible accumulation of path items.
2) Neighbor relocation was added.
3) Subtree prefetching was added.
4) Final emptying algorithm was reverted to the original one. My experiments
shows that typical number of passes in final emptying in your version of
patch is about 5. It may be significant itself. Also I haven't estimate of
number of passes and haven't guarantees that it will not be high in some
corner cases. I.e. I prefer more predictable single-pass algorithm in spite
of it's a little more complex.
5) Unloading even last page of node buffer from main memory to the disk.
Imagine that that with levelstep = 1 each inner node has buffer. It seems to
me that keeping one page of each buffer in memory may be memory consuming.

Open items I see at this moment:
1) I dislike my switching to buffering build method because it's based on
very unproven assumptions. And I didn't more reliable assumptions in
scientific papers while. I would like to replace it with something much
simplier. For example, switching to buffering build when regular build
actually starts to produce a lot of IO. For this approach implementation I
need to somehow detect actual IO (not just buffer read but miss of OS
cache).
2) I'm worrying about possible size of nodeBuffersTab and path items. If we
imagine extremely large tree with levelstep = 1 size of this datastructures
will be singnificant. And it's hard to predict that size without knowing of
tree size.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.10.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.10.0.patch.gzDownload
#89Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#88)
Re: WIP: Fast GiST index build

On 07.08.2011 22:28, Alexander Korotkov wrote:

There is last version of patch. There is the list of most significant
changes in comparison with your version of patch:
1) Reference counting of path items was added. It should helps against
possible accumulation of path items.

Ok.

2) Neighbor relocation was added.

Ok. I think we're going to need some sort of a heuristic on when to
enable neighbor relocation. If I remember the performance tests
correctly, it improves the quality of the resulting index, but incurs a
significant CPU overhead.

Actually, looking at the performance numbers on the wiki page again
(http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011), it
looks like neighbor relocation doesn't help very much with the index
quality - sometimes it even results in a slightly worse index. Based on
those results, shouldn't we just remove it? Or is there some other data
set where it helps significantly?

3) Subtree prefetching was added.

I'm inclined to leave out the prefetching code for now. Unless you have
some performance numbers that show that it's critical for the overall
performance. But I don't think that was the case, it's just an
additional optimization for servers with big RAID arrays. So, please
separate that into an add-on patch. It needs to be performance tests and
reviewed separately.

4) Final emptying algorithm was reverted to the original one. My
experiments shows that typical number of passes in final emptying in
your version of patch is about 5. It may be significant itself. Also I
haven't estimate of number of passes and haven't guarantees that it will
not be high in some corner cases. I.e. I prefer more predictable
single-pass algorithm in spite of it's a little more complex.

I was trying to get rid of that complexity during index build. Some
extra code in the final pass would be easier to understand than extra
work that needs to be done through the index build. It's not a huge
amount of code, but still.

I'm not worried about the extra CPU overhead of scanning the data
structures at the final pass. I guess in my patch you had to do extra
I/O as well, because the buffers were not emptied in strict top-down
order, so let's avoid that. How about:

Track all buffers in the lists, not only those that are non-empty. Add
the buffer to the right list at getNodeBuffer(). That way in the final
stage, you need to scan through all buffers instead of just the
non-empty ones. But the overhead of that to should be minimal in
practice, scanning some in-memory data structures is pretty cheap
compared to building an index. That way you wouldn't need to maintain
the lists during the index build, except for adding each buffer to
correct lists in getNodeBuffer().

BTW, please use List for the linked lists. No need to re-implement the
wheel.

5) Unloading even last page of node buffer from main memory to the disk.
Imagine that that with levelstep = 1 each inner node has buffer. It
seems to me that keeping one page of each buffer in memory may be memory
consuming.

Open items I see at this moment:
1) I dislike my switching to buffering build method because it's based
on very unproven assumptions. And I didn't more reliable assumptions in
scientific papers while. I would like to replace it with something much
simplier. For example, switching to buffering build when regular build
actually starts to produce a lot of IO. For this approach implementation
I need to somehow detect actual IO (not just buffer read but miss of OS
cache).

Yeah, that's a surprisingly hard problem. I don't much like the method
used in the patch either.

2) I'm worrying about possible size of nodeBuffersTab and path items. If
we imagine extremely large tree with levelstep = 1 size of this
datastructures will be singnificant. And it's hard to predict that size
without knowing of tree size.

I'm not very worried about that in practice. If you have a very large
index, you presumably have a fair amount of memory too. Otherwise the
machine is horrendously underpowered to build or do anything useful with
the index anyway. Nevertheless it would nice to have some figures on
that. If you have, say, an index of 1 TB in size, how much memory will
building the index need?

Miscellaneous observations:

* Please run pgindent over the code, there's a lot of spurious
whitespace in the patch.
* How about renaming GISTLoadedPartItem to something like
GISTBulkInsertStack, to resemble the GISTInsertStack struct used in the
normal insertion code. The "loaded part" nomenclature is obsolete, as
the patch doesn't explicitly load parts of the tree into memory anymore.
Think about the names of other structs, variables and functions too,
GISTLoadedPartItem just caught my eye first but there's probably others
that could have better names.
* Any user-visible options need to be documented in the user manual.
* And of course, make sure comments and the readme are up-to-date.
* Compiler warning:

reloptions.c:259: warning: initializer-string for array of chars is too long
reloptions.c:259: warning: (near initialization for
�stringRelOpts[0].default_val�)

I don't think there's a way to add an entry to stringRelOpts in a way
that works correctly. That's a design flaw in the reloptions.c code that
has never come up before, as there hasn't been any string-formatted
relopts before (actually buffering option might be better served by an
enum reloption too, if we had that). Please start a new thread on that
on pgsql-hackers.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#90Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#89)
Re: WIP: Fast GiST index build

On Mon, Aug 8, 2011 at 1:23 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

2) Neighbor relocation was added.

Ok. I think we're going to need some sort of a heuristic on when to enable
neighbor relocation. If I remember the performance tests correctly, it
improves the quality of the resulting index, but incurs a significant CPU
overhead.

Actually, looking at the performance numbers on the wiki page again (
http://wiki.postgresql.org/**wiki/Fast_GiST_index_build_**GSoC_2011&lt;http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011&gt;),
it looks like neighbor relocation doesn't help very much with the index
quality - sometimes it even results in a slightly worse index. Based on
those results, shouldn't we just remove it? Or is there some other data set
where it helps significantly?

Oh, actually I didn't add some results with neighborrelocation = off. I
would like to rerun some tests with current version of patch.

3) Subtree prefetching was added.

I'm inclined to leave out the prefetching code for now. Unless you have
some performance numbers that show that it's critical for the overall
performance. But I don't think that was the case, it's just an additional
optimization for servers with big RAID arrays. So, please separate that into
an add-on patch. It needs to be performance tests and reviewed separately.

I though that prefetch helps even on separate hard disks by ordering of IOs.

4) Final emptying algorithm was reverted to the original one. My

experiments shows that typical number of passes in final emptying in
your version of patch is about 5. It may be significant itself. Also I
haven't estimate of number of passes and haven't guarantees that it will
not be high in some corner cases. I.e. I prefer more predictable
single-pass algorithm in spite of it's a little more complex.

I was trying to get rid of that complexity during index build. Some extra
code in the final pass would be easier to understand than extra work that
needs to be done through the index build. It's not a huge amount of code,
but still.

I'm not worried about the extra CPU overhead of scanning the data
structures at the final pass. I guess in my patch you had to do extra I/O as
well, because the buffers were not emptied in strict top-down order, so
let's avoid that. How about:

Track all buffers in the lists, not only those that are non-empty. Add the
buffer to the right list at getNodeBuffer(). That way in the final stage,
you need to scan through all buffers instead of just the non-empty ones. But
the overhead of that to should be minimal in practice, scanning some
in-memory data structures is pretty cheap compared to building an index.
That way you wouldn't need to maintain the lists during the index build,
except for adding each buffer to correct lists in getNodeBuffer().

BTW, please use List for the linked lists. No need to re-implement the
wheel.

Ok.

5) Unloading even last page of node buffer from main memory to the disk.

Imagine that that with levelstep = 1 each inner node has buffer. It
seems to me that keeping one page of each buffer in memory may be memory
consuming.

Open items I see at this moment:
1) I dislike my switching to buffering build method because it's based
on very unproven assumptions. And I didn't more reliable assumptions in
scientific papers while. I would like to replace it with something much
simplier. For example, switching to buffering build when regular build
actually starts to produce a lot of IO. For this approach implementation
I need to somehow detect actual IO (not just buffer read but miss of OS
cache).

Yeah, that's a surprisingly hard problem. I don't much like the method used
in the patch either.

Is it possible to make buffering build a user defined option until we have a
better idea?

2) I'm worrying about possible size of nodeBuffersTab and path items. If

we imagine extremely large tree with levelstep = 1 size of this
datastructures will be singnificant. And it's hard to predict that size
without knowing of tree size.

I'm not very worried about that in practice. If you have a very large
index, you presumably have a fair amount of memory too. Otherwise the
machine is horrendously underpowered to build or do anything useful with the
index anyway. Nevertheless it would nice to have some figures on that. If
you have, say, an index of 1 TB in size, how much memory will building the
index need?

I think with points it would be about 1 million of buffers and about 100-300
megabytes of RAM depending on space utilization. It may be ok, because 1 TB
is really huge index. But if maintenance_work_mem is low we can run out of
it. Though maintenance_work_mem is quite strange for system with 1 TB
indexes.

Miscellaneous observations:

* Please run pgindent over the code, there's a lot of spurious whitespace
in the patch.
* How about renaming GISTLoadedPartItem to something like
GISTBulkInsertStack, to resemble the GISTInsertStack struct used in the
normal insertion code. The "loaded part" nomenclature is obsolete, as the
patch doesn't explicitly load parts of the tree into memory anymore. Think
about the names of other structs, variables and functions too,
GISTLoadedPartItem just caught my eye first but there's probably others that
could have better names.
* Any user-visible options need to be documented in the user manual.
* And of course, make sure comments and the readme are up-to-date.
* Compiler warning:

reloptions.c:259: warning: initializer-string for array of chars is too
long
reloptions.c:259: warning: (near initialization for
‘stringRelOpts[0].default_val’**)

I don't think there's a way to add an entry to stringRelOpts in a way that
works correctly. That's a design flaw in the reloptions.c code that has
never come up before, as there hasn't been any string-formatted relopts
before (actually buffering option might be better served by an enum
reloption too, if we had that). Please start a new thread on that on
pgsql-hackers.

Ok.

------
With best regards,
Alexander Korotkov.

#91Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#90)
1 attachment(s)
Re: WIP: Fast GiST index build

Hi!

Here is last verion of the patch.
List of changes:
1) Neighbor relocation and prefetch were removed. They will be supplied as
separate patches.
2) Final emptying now using standart lists of all buffers by levels.
3) Automatic switching again use simple comparison of index size and
effective_cache_size.
4) Some renames. In particular GISTLoadedPartItem
to GISTBufferingInsertStack.
5) Some comments were corrected and some were added.
6) pgindent
7) rebased with head

Readme update and user documentation coming soon.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.11.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.11.0.patch.gzDownload
#92Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#91)
1 attachment(s)
Re: WIP: Fast GiST index build

Manual and readme updates.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.12.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.12.0.patch.gzDownload
�M�BNgist_fast_build-0.12.0.patch�[yw����|�
3cf1��X���F2	�	���������^HW7�$��>w��Y�{'g�M-�n�����.�VK8=/r{:�������N���M��7�*[�~�����X �����N���������7���������I���wO<w���8�-�X97����	?�R-�dO(���tn��T�*������	6G�J|yr���"k!��Q������~��VDS�;�L
��BO�I-�a5W�\8�'b���&�H�R��;�M&v�Z���0Q�d���Z$2�4���B���
��ji�����-��T���e�/�h�����2v��+����s���+`V�~'�=K��yN�h��<�P���y�"I>R��8J��

�L�.���d��G�o���5�y��:],�����H�0w*�Sby�N(��	`&���k�.��72�5)MUx�&]��x�t>�p��7|������O$����7 X��)M#�_� �(�����#��@�&=�K*G�+lAI��k���V��\�'��e��B{-e|�E[Z����-�\��y�*�#��O/R����>���=|qx��	�b�8��}����x�M�����T�
TT�����@o��(W\uP��G����������[g��y��Y�AFx��:�I������>L#P&0���w�q�Y�d���S���l9�a��ZH2f���&Vr
J�I���Z8��/Z�U�z��Us|�Yh�5��It;�8��# ��:
������+:(\��I��n��+g
6�Ks�0qThG��Os]�&����X���l��J��,|��T��G5��1:h��������"Y��Y�]�:�g�|���D�2�Y��#WjM"Mb5������v�l�e1�0��_sx0s���U$"r�4F�������6�9%�t,��Q3��*��S�M�h����O�5Y�N�-���r����e��9p�E
�����mY��R�����Q�������)/q6�;�8�b K%���o>|�r�1�j#S�A�M�4���R�:"]��R��k���`���{��_!��d�3��@��H�E,�k�r��xt����i�D�kA�|��(�8���~J:b��@���"����
t��P4e
���L����% b�?I��t���CTv��);i��@A=�l�;R�B����$L+�2M~�����Q��xB� ���H`�E	w�t�S������sg	�D�6�$��1"��'�?��
����J�H�h��o@}:��_��,>@�b���9�;��Gq.��Q>Fr�I���������O���x���
R���z�?�*f�z]l������ �D�|�'��'���zS=���Et������s�9��wj��p��F��F�G���;Q{u�*��o�<g��jp:�_������bx�v�3���x���jp1�7�"� B���u [9�g�����G=b�F8����B~��`W������O��F	�;Y�\TF@�����������!��G���J�O`{�u�0R"�E�EzJ�)t�%r���pP����=_�Z��gg���_XB��p��V����?�M,����L���}3�,����K$����I�������������_pX��
��X��kB���9M�[o=���Aa�{���hU���v�J���������b��w7;+�m
���������`�i�'^#L�E����������10n��%z^�}.�lN;�?���*y��NZX5<~��n��m�o��b0�s*���SAd�������E��a����%8w�i��7(N���$���$
!�A2��-��]7aJ1Ka
 m0�l�t�h!ta	�\j��[��$1I�g�(@�`Tz���$���0����m�P^:��2P�,ds�-�`}������"d�������qQE{������(�~�`�������1�i{�Ep����
����W�:�\_��(��;�W��P\���{]���h�W��eo/���;����T6_a7!z-���
2�
�b6!���gC�*'�v��??`Pb��~g�'W����/��4����v����u�V��.?�'?��L�
N?�?D�h|yz=��_���o��m�������j���y���)���M���*aZ0��l�y�����XxB,���0����M�@wh0'!�>��XA0# �^K�}�2�Ej�����/>�F_����W����?l��U.'x�����{�/�{��G����q���G��������6 ����1m)����|���c��Z��{��:n����������CoB�Z2����x�$b�`Q�g6�����C���Qz-���D�f�HG��"�@C�B��I��?Ak����(����^���=���� ���(���}:�c�g�,&A��wV ��[�oqL��TAME�[*�A+��w�#7X4��M�4��G�u���Jb��U�����?k��.���%��3�����~���3�l�]�������n�N�����ew���C��s+������R������2O�0�����F`��Ql��A����8>k������
�;E����<�����q�40_f�6��$���7�
��R���xr��x�2w�S����np�&B�6���uj���%�g{��f~�6
z�~�]2�*��A{�8/<�wS��`�/��Y�a��^r�8h�?�OZ�C]*<�F�����"x�)8D��
�I� 
��M�`N�k;�X*G|:�����nk��/�c`��L,4Y7'�o��l+G�_�����*A��1z���*��dy��_���s.Cs�+��;�Y\�����c�����7G?��� �C���1�w�}�_����_��V��vT7�iW���
���-x2f��e�����R���������e�w��;��A�N1��i5��VYvCF�vG��#�O<�g����T�w*��;���=lU~r�0�y�cB��=l�b��n�$�'��yg����M`v)f/�j���@@1G��\b���A���b-�x�m��%�&������9��E�wL���[�7]>����%]�������w���@8���_�(��p���������w�zW\G|xMW�^�V���*��g������lrv9]~�\
�����	�S��	����{b�)����h-�;z���@y�O~���
h�<���hx�������~E{��S�b��8$���!(�w���&M��s�-����;��d�
�"�Q����d��O�:6��H�ya :GL5K�RP8x3�v�xA�X���'����N���	�U��q%�J�
o�*^c>^����jz��	���g>?C���n ����%���Ij�{�l���������������|��l�@u'W��Z~�^��:r=���#*s��z���[f�:x*T`G�*�P~�MjT�aWvM{PP���1�&-%J����cM�s����rY0��]Q�~�r�Q&��T��F{�h�*n�v�t�q�5`E���2����n������Ve�v�&'_51���?a�_5�����V��V��8���P.��#]s��9�Y�*2�%|J�J7b��#G'���m��q+x�&�(]2��uW\�u�:��;W
�����8���8�c^N1�D��C�����>rY��_BK�<�m
�e��;(2����0O�[:\(������N�>�b�V��ub���S���~�m�7_�E�([��!*���a�6���)^��<:.8:#$s$���b�):����)���`������ERc��"��(�7/�"�tIG�E���4�8��� �3������0���:L���|��m��'�c��np��5Vfy�����dK�`�''�&K�jxp/9�&AC�NAj�\�p��Ly�5�Mb���`m"�T���V������h��2�3B�����t*��|��aC��C�M�����;���d�O���bhU
��b�/��W#�h���Yl2R=%Tt#3f����)����-9�H#�}7)���P���=R.L���=(x.�Q7��T'qv�CG�&�,���l�;�l��i��k8/d
V��Y��������8��U��@|
\���D�L~'x��Y�\�4d��o�6�H�	�?��
���g��R�%��������J���(���XDI�����o]��	�7;<l!�tC�*�J�U1B�{K	�x5��f����K9	C+s����~K��2�]�|ED�I�Js�3O����'�`�����i����<hcY�L����P8.�Q\!��j��K��v%^��f��}r0�e �"�k�7����p.S����B���Z�,-"�����+;�;��Y�c��3��L��g�h		q��&�q)���r��c��r���9	���j�ud�(�X��[����5u(�cc�L/J��"� �"��������i�<��w���G%�B�������FL��d�c?#b���G��2=h*��(�XU
f����.�����J����lP����f�o`#O�R���Vh���K��=KvgN�������y�j�zL���Z(R�b*�����1���2��\��#����������8�'����6��$�b��7�|�Od�tD�%>E�m�[������0	&���
������pON��C�����w������F�bH�����ja��9Ik*��\p
�=^E*}O��RX�@�����+��O�"���)�E��Q��&�OG��� �������$��t��1Lb0�[{�|�]%��~�����"�|"���1��4z��������I�G��DHF�p��?�B�1�o\I�A	��������nx���9���u^Li-����7Ji-���t�W;-Z���
.���[�`l:��e�T��o�{<�Y�4S"��G;��p��hQ�lHc�^&�EJ�>)�d��g��ZV�v@1��9O�����.o�����St�.q���9�0&��7)��hH
�K�g�tIbM�u������~��x�;���.�aZ?���K��Y#B�����)�(1���t�N�)d��Y����*\R0�VH���u��>3=��+{�W�9h�8���n�|r>C������xp�xJ:���X��N}�D��f<�������{���������VL�%��������3w��tk5ZT���VCtB^���L��	C��nb�Q���u��
Ot�N�646"��{\�������Ia]l�b��c��+��W|����=r��\��2��{���k�Y��������&|)5���\��)9>X����D��[7Q�>^8�44s��y�mN�B)l^C�dN��(�{7�aR��I�G�������A�nr!W�|��L�T���
������Mq|L.eruy9���pq�}i�c��EM��,������7W���z@��������pehw6
N��V����[�Uq����� P��td%��d���G��J����]$���s%D�kj���[��`m�F���r�$�&�c9��>��c����������f�lE�!����
���L������J����m�)q2|����s����JHn�GD�0@0@&����\[0�r�x44���=V��c��V_�
���&:��,�1z4r��r�d$�pi��&���;����Hfn�'V�3K)�:�V�1;'JM@���f���5�}���-G���[3�`���'q���t��Z������������6�l���O���`	���88c�p��8��9^�FjA�B��%��L�������hI`�����Yge�����v��oc�������(��E�Q�2�k����>�1���v�8�@>;6�����J��V�����H����n�&9�7f=��C~�E=��)�������W�<&�F�U���6�Qv�
<�V����������h/U���n:4T��c�p�M�������5��Vi��w�Rd(����h������������H���b�JN(�!�Zo��yt��_T��*��L������O;��Z����?��Uf����3����5:FO�I�hs�E@��������/;A�E/���Uz�����j�;/q���W_���;�'l~�5�J��Qi����|
�]F,�j�q:1���	��p"�v9K�!4�8~,��8����(����/J�^a�L�z�W��
���c`�F�;�P*R�;"�Jt$� ��~2j@����%��1�l"_�7�wu]M�[�!� �kzU�*B(ZglS�8������t�X�
vI}���
�b�y6)[���������,C�u�&�"����L��S���N,�^i�Z�=h�^F{��������[��NN_������>98��c�&V{8�Sw��lf������b'nq7m�Yt�\-������d`�E��n����
�vs-=}���T������z����*����v>� ��QoB��4�yp���D��sQ�WuZ�������>��}���$T�G%��Y�j00��}8(tR�&�%�[��i��IL�K�CA���p��#!�
�d�sHw7f���]��v��[�"�T E���B&0�A��@��a-����3i
�If�����J��7�h0����o����-�S�Lu���R�It�6C<(�kT� �vQ&��;�"1��vZ~F�Zo�}�}����o��L|�Yw3G�7�|���z��h�rJ�+���%x>8w�c��O9����U�V�����2x]��q�Z�iP�����=cB���'�e�����0�G4)�U�5j�n�H�����P�cGY���o\��}�����CEd��
qh.mo�f�!�w�d����������?�s������3����a��
G��5����>���~:� VI$:+�$������������]�I���� M��?��\��d�n��I��2�%(_�ZX���e
4�]����e�#��!��#0y%�����d����O���o�O�*��[WXO����&O�Y���n|�������Rp��X��>�&��q(�hWd����|7�fz��F�����q�{���kES(K��
�<��f�K#�]�����m<�>��Ng��f�B�_d�
�x�a����������f�$��>y�N������/8/n�R���� ����������Pb��V������ZI������?��&��O��$���u�����X�n#������[��4�2=��$M)o����H�������E���k�)�^'����'��76�:��
&�f�(�}mo�p5���@��<���D��	��l��|T�jd��v�Ov_�n�N�|�)����H+w�	���	K����Q~I��gf��W��#X��&F�1�����6<�9a���/��^	h��q0��zR��#�_������u41y^'��=���AK�/ X����p����)������!�f��>x��m<���,�8ld%��`l�d��JC�����5��d��4�%�����g�"��=������lmnhn
��3z�~��������0���%\���a`:�&�!L��7��W��o�3���}C�W����+6��G��lD�k�Q6���7q��7+�����w�qaz�PJT�-��>�v8�N�1l{�]N��*����Z�|0jk����t�I�s��x��'nY���.�����.����������e.�X����|$E�1s��z�Cl5O����"C%�e�������'9��v-������p�����k�8+
�=p�������w��#?D����y1y����`x�,mt�Y����"�E���`i���MC�Xa��\<����*s+6"W	��m�m���c������&
���R1�v�&=TTK�]�h�d��������~z��&��
����U`��bRUD����vb�!n���S)�%�{��6��H�4w��\��+�G�r\23�O��RV�E����R#T*���Y�
����W�:���%+wH$���u0�����ll�Q1�,L&��%	�.Y0GG.�u6�32�P��X�G����aW��G;LH|�(@��0��j:�~��������m����"� �bA@M������>ea,���\�Y3��5t�����'�Eh�S��S�C�#�)�J���I��}�;�u�@�F��M��"�������t�]� ��1��!|����6���
��7#�V�5��a�Q���H�����������
JD�s��z�h4tz������x���������c����*����������v;@�'fw��`(���������8�!{�G��o��:������:��)�(���,�85=�n�[l��v\0Y��w�A�c��Z>	,z|�>N����F����b/�6�&[�����-���^� ��sz�
�H�:�1�%^/���RsK��eG�i24�*�j��Z�F3�������7��	�<��o81"K�l��k�B�ed�[Fs��`/��C�6�>��}�w���&��#B[�����-7*"�������L~��^����Lh���u B���$�^qr�Y
F7!"J��%K��17���
x:1]��,�	���^�/�^�2�!9	����y�����+��NU\�����,Vq��OK��uW��4���(�	X` ��3��h�h�<%��:p��s>������~w5}odO{ �"0���*���H����<~n���ll�6Lh���8�}%w���gp�����x<�l�?��z|U��|�N���`��B�%�
��y���8\Z�O�
�L�^�D.
�*\�����*���1�5kj�0���-���sx�������l(M��.������m��a'�^�6�C����0u���x�
�_�3�+��d���`��8�3�&KRg���T]��V$Hzb6
�R0������pO�a��;��>���w��o�f�;�,\�7�kxo�NH��/���>���;�0��>sD��{�^���37�_��CF}��nv��iv�q"v�,���/l�V�4RUWY��I6v�'���O����v������Q{Z5��w�m���?c�-����M����[�����@]I,��54u����G ����%��i�������]ZBmxU��B�+��|��m���Z�u;�x?t3�_t��
�Z���y4���1������������
�
�7�E�����,Gb�GAzb]�rn����V�k�)1�����K���I���K����f�����hD����
���H��^%F����Bv*�����NMt��^�
��b��y�<f�Q&,�bjN
����)Z�W�5@pNT���)�V�K|��8L�W���w��aeuL/��Tk|v2��@o���&�&�Q�#iski�����F��K�G:���E-�=%k)���D�F�GS�h^�Xn�r���s+��"��Z�D*n@B�lC��-a��l��N~+WV�K��7�����kJq�
�:����fy��uW��[��K�oF]�~�����!7������#���63$��k��������Bn��g�M�E}���RH�3���f������na�+__*3���+a^$	}C��)z���Z�2{i~[Ya����a��2�����=;f�u0W�������s�f���
`|�5�
6p|r��)o$n_!���/��,.1_�������)�ZS����~.�
�{es�d�Kn�o��au]�[��"�--E�1�F�6%RB�����uf�~(%9o��[�[t����5y�!k�4%�z���S����np�hCG��T��&���W�7����;+��x��c���j�����)��������o��tj�e".�Y���g�&J{����'�a�����%3F�}L����m�����J+*^M�r6Sy�-v��#�g �R�_%��g�#eA�G)	I�Z_
�����H������IMs�<7>�o5��0���p�l��Yk����o�Q����P��S�h��'��\d^�d��z��o(D�� (�s�w%I�f�8:
��$;|�X��~R��t���f��Ba���� �\�(�?�g�H�r��Bi����)F�#]{��OL�"f���
h)����g�8����?��;dBI��\z�z�e����~�6���\������"������E��6�%s@=���`�����{�Y"3-f�%���h���3��a��E1q��Y�q

����_2���\�j.�W�(w�	�!�PJ��7q���`������x�b"�������xys�AK	��%Ln�^�F�����N�d��l���1��>q1\"�Dx]�n���������GY�|��$�s����������+h����-����o/�7�x;�g�2�x��
I�VA��cv<C�g4$�O]��-��\
����]oAI���?m�r�R�c�S���lh�pv5|��?���vj	Lt�}Q��N~�dD�
#sn��p�NK�8_C=w�_W���G�sV;�I���ci�d��N��\�e_O,P��&��I���nkv}(��#������@|\x9��	����Z{�����e������IcZ
��%�+���`������	h`�����i��u��A8�M�X��*����,-����#�#q&��{U|I������R�4s,Rt�8��T$9p+���������h2��q w��dn�1��"A���a(��o$*��O��5��c�ws�P.�0�C��/��C$n������2���>��H-s{���/@�q#j��q��v������������H�D|*MQG�'y��q�;�pSNw�O~�9�;�{;&o����������\�("�_6�����������46�w��L��V{���!x��#I�"�C#:M�?����|�e��:��{z+�-���S9�V6M�L03�z	OY������#By+V^���H������e;��)�����m�K����5��5�2�h~j����U������I��]
g��?��
Bq>!�7�Y�@O��W]��\l�5>K�(���d����+��a�H�,z0�t,f�d�t9�
��$#4#�������d�����H�
�����0����:� h�5����,�9 ��yV��;V��,���t^C���Y�%�w`
����{��0��[���g��I������������Cm�c(��.�;�ygKY��l���Gk�5�~]�w�v��l�$�/K����lo��0�����\1�738�_q����-���#���/�����x�d����uY\����b������rN��w����I��V2���|��1Pe"	_"�Z����|FW�A���zk�*��)0sg��O�@�����0���c1B�.\
4j;�����G.���?��������������w1�����gG3L�b�*
e;����*���@Zq
U!����r_k�PI�(�]����u��1�~_R�I��t�b�|��L�����K����	����5�%��b]�&�QA��X���������[�F��v]��+��?�&z�v��eD���=-2Q��j����0r�����X���'�n3�7�FU%����r+�	n��)8����)��f�K��N������{�5�[�f����	���dJG��q������@���q������Zj^�	�n_z��O����h������g����J���+�
MzA����������B[�����x�={�/����0�@w����)�"�}��=
���D
j/{K:�$=��:����m&���[ia(>��-?�O
0`mn�����V�s��@&����T�������
�������=�a�u���>&Fje��r�����28�������)(SM@o��yO�[)b7ADzy����5f��K�zA���6��i�A1@s�������y$�K9_~��!G��O�[� K���s��R��M���<�p���OE�X$�U���B�?�P)�
=��_�pV3�4/6��M���kz7�������e
�b7h���7�}�����M3W�I���y�0-/���BrAs��@.�G]�CQ�����.�x�&��|���.�����D����P_��_=���M�Hs����J(��1�0�������]����������5*f�_
�("��<d�jT�$jT�4���=�M�
����Q�r��Y��F�����Ga� �������}��\1�j�z������Nh��K$i�;f7���8���g���Va!��1\��6��zi�['=��v�������e����]���x�;n^�C�	�w�wPcy�+�����Q�C�:��;HL���|@��w(���M�QJ��W�9buq�������N�tR]�%B��L=�D.&Z�`���DT0 *��Y��������a��9P~_DB��)vs����UGg�8!F�����������������M�S��
k<��������sl�M$vz�|�O�����/���Q2z0�������}��D�wf�m���I���X���<��6�\�v��sw��A7�����f~of����kkA�+-e����5�Zj;a�Q�Q����AT��������U�]m4FF���T�p�#�������"!�
��[�|�S")����	�u��dI���k&�����|$��|����@ �K$����@�)dpNn��,�]�4�o��w��"���xxMu8��WX����LyH#������T�#����

��4���G	d��[#���D��Ul���qI�S��7�U�E�z���W��*m}<��r�����m5:�T3{!�u���a?�W�L�5w�FL��@E�"��rP@	�!�eu7�����:�������h��Nua��9?��C���]o�mC�r��*�HDq?/5���Y�����4p���%[��482F/5��oVql�\���
�;�=I��C�J�|S��^x��^���*�nS��N�,�I0�����u��m���d~���}������x�w��O&�/F���� ��
���%B��V��A��lP�7>#���u���#r��������O�k�@�^��,8���K|�@��T���G9(	�!�ap��NP���90�Qn��S@�;5������!�j�����*���S������i6d��@��T��A�xm�#�5M���c����l���rjs�"�IA		&q��iRv�)���g���"�_���Jxd'��8Ps��)��R6<�V^	�-���n��)`�Xz����~�Fv�����t�*$�f[3;���x�T�=w�6�o��,��7}�spS���Y.��4I�Qw� ���L����O�YLH�G\�
u�wI|���k��Hf���s���UN&���_���2��=J	�
4�UW�	yt�V�}l#0,�4Ye���m8�.3��R����^Tls���HT�d�J��Z.7]���_����91�m%e&�]�s^��8s-���1���"2�������J�T��k����4FR��h�����6D����Z��%�����8�1��jD����_ �)�P��X4��J(�q[RN�h��f���������jt������8������:������`)[�����x���_r�B�Y[�kt���l]|&Kh�t��R���hJ�{gtX��-�0L {j/\����0N<�A�&���C��2��Ny0x��i�Hb9�~�������mkD����	���^JtUh�F�A@�{���%����:t�1��[�6��1|0U�Mk�F�5\$Pk�P���&B�c��NN��vON�w�9;��U�	��Y�g�#$������V�F�k{jY��bZ-)�Ar�cc�$6E73Cz]�����A�`�0Y����!��we�U��|����>��:�9����k7[+��8�,x�$�0
�`�j�@���W
��`�����'���%�u�w�dX���+v�V�R�:�QT��CV[����;�(�U����o�<F
��b�X��c�������r�U\M
������#_���@��CO(A�x��%m����E���^xU9O2�H�<�
8�f"�
����/��u$F��Q��M���vk�����@��?���B�\l��g�R����[������H�����Gl������������7������/Dw+����piw�7��bX��VG_C�]5��+s1�[�~p(^P����E�eG�g������^�6+�(
c&
? �ZP�k
�\O����"D���=��`��^sH��J�����G:�#9�F8#�>C������`H�yc���=�S��������������%�6l4���n�Q�6�/MF$����Rnp�����������I<��n'���}�yve��D>��|����q�6��?�A���,#��b��&fA~(9���s�h8v
���R�k}O����\b��l}p��N��t�����������c��,@��*MqF�����W�k���"�j.o
U�Eq����"N�W'�{s�Xd9"�z$k����D��$�Cy�$��j��w�e�,����|s���r��"'RK��oN�vN��[�����*�-����~�������o����ezc��GF�A���p7;�U�>��;�3|�+v������|�J�������b��._9Y�L������s� �|�
`>{��6��n�QbT������� �����}y���B�;hZW-}r{�E<�s����u��*+�QdE=�6�������?�]�Rf �����~�F�Y/u�o��DA> z%�)���R��W�5�|lD�W��h}P?m��}��j]����d]�����Gp
�yM�v�t+`�_�-r/���&	v�b�h������T��s�^k
�'v��'$Wh�������ZI=L>a7�	&�����m;�4�����i������w�h��+�}���xI���3;,w�����g.��K]W��"���
gj����<!�"��0Q�}(�k�P��hh�����8h�����}��L��(��)��{v�����!=.���/�&.�)Y3�����^h;� ���%���;�����������F���D��q��m��p�h'�4��|�K��R�a��S[�Z8K�����������$���fq��� (����*D%S�i�V�'n���yc`	`s0�z������p����N�W"�D���6O��=���\�����%J��#��I�����*�&�$���(z�Cvu,Q+� :M_���K�����{|��Rs7)4w�2sME�f� ��dC{�7�?c�P�)z��QiS�eB}���	$8�j�_���~��)�;�Y]p����X����FIhF�����5��m�A����ff�t�]�dM�NV ��J!c-�gE������5>�u�g� ����|��h�n��<x�u'[J,���=��7XR������t���t��w�o��,����X�]��������:'�����4���������4E�l���u��^^���Gl�^�8:���f"h�!H�P���k;��R��
���������j���Xcz ���H
h�Q*���.,�N������;
Y_(^� ����7��L�-l��E��RS0[��Cq���=��K�C�.P"F���~Y�Y>�+_�:�[��}*�������&�N9�������Gt��]F2�'c��a
��&�?��g��'o��v#B�d9yW�eH����y,QH!a0p�0���_wS�j�@w��OI����:3��uW���#�q�*����|j�/�� O�;��k��m�{C#W��� u+Y1x#T���4�Ua�^�&e(�{��yr��`X/�S�Wl�2C���U�Ht^�&�S#[-������"�~Y�~?�(�	p��U5B�=,�)k����{�Vt��)9��^!��w�����R�,�$���`�D����L�S4�
�BD��?1���� �����0�zBq����j�xF�5���^�9N"�6N�7��y,x�/{/��*_�t[-�l����L��fm����K�����#�?5F���s�Y\�O^��(��}�0;��`��R�=�z�s��hg���pk{��������@��_m�l��~}�s�\���.��6�R��nJ��+6����8o���2��l�5K�/,c�#I��(�����'@4_����)'�������aR��G]~��On����V���6�o����<��Px"C����Md��7!�
O� ��)���,&��Q��������(������;��i!i���w����(>��L9��Y�������\�V�{|K+~��J�gG�0�{M�
�������a�+2�x<Z�w�"�yT���nY2���&��r7q�b��#S3�LR���5��n�5����<��*	���0Ijp
n�_�� ']v	����i�(2�'���0���$�Ji�9�l�+&p�������|�&n��"�Y�	3�o�k��RWR����w��L3��\�Q��F�����0�x�=;��Ay��K �5D�&8Sj"4b!C�}1{��\&��gI�W~�(*"��'�Bl�A�PhOHuy�l�&�7�D+u�V�Ux%�%��:d��-1E�t�����9|���4����U�l�3�z'���r�!��/K�����f�b�(Sa��i��0��J��u�U�K��s�P$����c���d,������B���+u0���h����N��N-pDh�n9�(l*>|��8-Z��yp\�;N���g2�]�v��t����?fw�R�I�48G(POF}j9I��S@)����J�@Lt�DpH{��+6�/<<+��v�����EX[�@�%�������Tu@�so��JB��;]]�x��������Qd����b�7����uL��_��l0_���P�����zZ�M��q��9�g���-���{X~����lg(�	hQ��?���X�{���IJ�*0�Q�
c������I�p�	B������V�p*H�TuT�j$�2�
�����T����j�=�L�a���c����U��Eb9nY_!�9N��	e��[��� ��G��J������Q�����Q@5�`ReT�*�����F��0]���L~�f7=4�D��Da���d������bO�<�*����Y��,U��
��c�2��7�Z� ��=�%:�x��+{k�S$�Y&i��k0��
H�<w�A+��T�Q�14��M`W"�G�0=���d�t��"����`�>1�AZj���a�r�����e����M~E����u�X��eBfG�lvo�4��9�zH�t���cz��	�(�.���-������7���z��Ms@i�zO��������S���ia*
��7I�
�F`3�i&����$�Q\��YC���-�rP��T5=����+��]bN��Ba��������C��1���������_)����*3�\���#��a����Y[�����^�%�hx�0��
�cB]�&k///�>�m��]�	��F5�a��g��Wy�%���S3�2F�yge�����I2-D�_N���n��9_M�=�H��f��Tb���3�b�TN@��E/��;k��}se�����<'����� 
*��%��<�
h��S�<F�G�m1[�L 	
Qq�Xt�P78������6�T~����������u�X��E���V�o���B8M�}�6�qw<����9b;�!@���s�����f�l��=��8���a��|Q��>����b�l��vX?V��$:w����H�d91�F�����R�����aj19�&"C�<���4m�����t��j�������a���X��^h���k#�b,�I��9�U�|V�ox�X��Z���a[�!W�E���5!��Fa������`����aC��5�P ��D�|�Y5
J����8U�|LX�rt7M|/�[�������U`jg���-L��
��N��}���?��+�7��,%`X��c/���/�
��g�n�?)�'@��s�l�r��Z��G,�4������V9'�z��$��
�G&�-����
K�g�%0W��~���XW�,�j*�	�j�s����:�����3�]�#0�*?$��&�b�Yd9���*��>el��c��E��28(���K
�y>���	#����,��:?A�Y,�UWZS%#�W�U���(�/�1�[����/;{�?m�������N��$�4<�#)�zT�$\^ff�x��;��nz�(ud���$�����|t�)��ij(����c�{�SR9�k���[d+���D���~���%����v,=�12�B������,<�H����$_c,������dr��n��e'�Z$v�u#�t�����"�D}N>�H��M�(���QN�Z�GN��G�
�����j[B7�YNk���Rp�@���tFH6��c�n�K�CP�4�O�Z�����MN���������@�!�������B���8��e����v�H5��PEtL����W�ypu��EO�
}�;�ET�_PN�rc�HI=L�h��dGB;}���u�Va�x0��"���x���`���dw��4�L���l��M���O�����W����G��~mn{I��f3�W�6w�5�����9`�n��-���^�''w@��.�V�'p����@�,��R4�t��������
3@�M%Pv3�A������v�x��<7�KF{��j�g����W�����V��?
hd��$��e������������J��u�(��	�a��n����$(n��
(�]po�����[_6������jE��5?�����.�D�a����������|\}���� �))�����==Rz�Y~�,X0�~�u�G`���}�,�s�������6q�,�����Z�n}����=���x
���9��
U���L^MT�����i:}�����R#):�����
�w��f���b��6��(d4�U�^������U1���k�/�{��%o
�/H����mmX�-�W3�@���*��Z��<����(uR���
���;��:��s���l|���� ���"<�k+��@��$m�J"��x|�T���U��?��6`��CF�0�t����l�{|j�������h������|8���Xf����1�]'#��k.<Gx���W64K���k/����|�#���/�X���Nq���'M�P�]�['�E���"��LI�7�?����:8%�}�J+�|�s_2���y�2������5�XH��&�cA�m�8�&%��zk�����!��UsA�}WS���)0����d�E%�-z)C�6�y��,��j�����M�$�E�A"�o�TfL���0�^�1.WvZ���|A��P�bHn�3�M#�k��m�!]�uA[�i8P��!���`��CS�<��bI�v�($�� ;���;?���z�+����}^Xy�<�b�[������o���V�T*�~��F<��e�C, a�������#���)��F��)�6"EA�~@.���+,�n&:du����e;e���,����N�[ho��mG��Qj�����#".��gb���Zo0rn�!�t����C�"G����3CNY������l�<WS������X�0�
j���u��{X���iv=)r�����Y�[�xH:u�l�v3�@7;��;8<9��(�X�
�qg�h����O����';[���}Uc����^���28����lO�7����0���u	��W�p�7��w��^���A�Z��,���0�;��~��h�����A�3l��]��BM���%hZ����w:v��n2)�����^��iN����;X�9���_A7�'G��of��i��	=�)4��W�h�7�P�Kn������A����x�b|�]���=f�w=*��\TR����a�������D��W���R�Wqi^�������GcX��'��W�~���p4���� N���|��yX�/�_�a��>|�������GEsP�:*z���vE��a=�f��y,��������Y�~�*��Y\��X#��f�?A�Z}Q����sz���Q=
^3�<����������������e
r�}5L@Gx]b�tc�o=O�h��&���R>����o&������y��~��JQ��������H�(���(�EU�D!'��7����N���b��QeM�)�����C7����g���2�0,`�a��J"�z�2��4?������������f���6u�zD:�Z�$�<z�}�x���/_\�GOM�������bn��7�t2���k`��m��������G������MN�^W��P�BUR[g�u�m����N�I�[���<HqQ���v�M�x�}��*6��k.$��d�(��xv���bU_@���.�e��r617y����:��.F�)���V���#��1��o
mB�d����3�k���%�(�K�����<k�9Z��y�'��Yfm��������
66�*���X)�aR��}��`7��=�vD������6��dvh49�i�hG��{�?�/a��_������7��������L���`lMq�aF�T��[-[�E��M��m�=E����l0�<�'�Y�(��M���r�5��Y�����R�4���cn]�]��U2���]�\�g �yP|��^�����u���y�s$}$����ti�gW�Z�E�"8����n��p�0^�(:�34� � ���jw\ ����������rL��[�B������di�B�~�
}``��
DbZ	!��Q�c�x�e� ���G3�9�J����^-�
������{�'_FU.C���������[,�3���K�,���~Z2"3��!u������i���V��##��j2^�j��^�)���C�:(������8\"�,,����#���{�
�G��O7t�,-0�w7�~��x$��nf���N���k�R�s�r��?����s���V�B�8es�m%%��f�E������-��8
�
U"�0�"
���������7:�U�3
�pP���#�6<4+�h�����J>��>z��S�)�B��U�H�1��yR���}8�[Xu20�q��~�:D<��
#�29�>3l�i<����K��F�
(��lH��,�b�Pl�K��Q�X�7`h�=P;g��|�%����	j���[�$����F���*Q_��N7L��R��
��������+�a,,���0;Gh��<NO��2�O
1�����A�_R�U��m�@��M]���k��JcV���GY�)�.%�Q������j������Y�P��z0�9o����l��)������T�`�����Ub���5�&C��L��t��'����j)R&i5�{@|�%R!�s��-N�����<_������4���?��	w�jfYye$ 0hR]��!�h��{�S`|���W+n�w#� ������<��s�J_ST�TkK{���1uN�����o�2o�E�W9Y������i��G�e�s�����v4�tN����q���oOb�R����@���7�&��\�+���n�	Hp���[!���{�'��lQS�&�L?9,{�d��2�����I��wl54��2���E1��C�
����7���o����Fo�Mh�w����Rm��d6ff�0Q�������@�=U�5?�+��J�$E5�����j���������x|	��l��<�$�B&�9:�:��8�)�
>�G��vm���Ws�);����>������>X�H��i�gh�h$d#�?~���S��&]_g�IJ�xd(�YK0�������q1=�>�v]��<F
ea�MA�����-X����Q�u~�;.`
,��fF���~����<�(r@�i�+%�TS#��K��2����+A}. �~��hc)�=����kk��~�����:B>��ADh9����e�E���7%=V���h�����<AKt�e���vvG��`
~�uE�%����9�
��Y��z���z���G�Yf���������������o=�Nwx�f��8�@����!���=��y��!���3[j��������9�����UT<��/�+������_�0'��f-�*q�;�F@�b�n�*��|.]OL�F`��C}i�~���F@'j3�����/�e~��5�� ��\bganD���f�s�Z-��E��~9��C���h��:�?�Q��V�Y��V
#93Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#91)
Re: WIP: Fast GiST index build

On 10.08.2011 13:19, Alexander Korotkov wrote:

Hi!

Here is last verion of the patch.
List of changes:
1) Neighbor relocation and prefetch were removed. They will be supplied as
separate patches.

unloadNodeBuffers() is now dead code.

2) Final emptying now using standart lists of all buffers by levels.
3) Automatic switching again use simple comparison of index size and
effective_cache_size.

LEAF_PAGES_STATS_* are unused now. Should avoid calling smgrnblocks() on
every tuple, the overhead of that could add up.

4) Some renames. In particular GISTLoadedPartItem
to GISTBufferingInsertStack.
5) Some comments were corrected and some were added.
6) pgindent
7) rebased with head

Readme update and user documentation coming soon.

I wonder, how hard would it be to merge gistBufferingBuildPlaceToPage()
with the gistplacetopage() function used in the main codepath? There's
very little difference between them, and it would be nice to maintain
just one function. At the very least I think there should be a comment
in both along the lines of "NOTE: if you change this function, make sure
you update XXXX (the other function) as well!"

In gistbuild(), in the final emptying stage, there's this special-case
handling for the root block before looping through the buffers in the
buffersOnLevels lists:

for (;;)
{
nodeBuffer = getNodeBuffer(gfbb, &buildstate.giststate, GIST_ROOT_BLKNO,
InvalidOffsetNumber, NULL, false);
if (!nodeBuffer || nodeBuffer->blocksCount <= 0)
break;
MemoryContextSwitchTo(gfbb->context);
gfbb->bufferEmptyingStack = lcons(nodeBuffer, gfbb->bufferEmptyingStack);
MemoryContextSwitchTo(buildstate.tmpCtx);
processEmptyingStack(&buildstate.giststate, &insertstate);
}

What's the purpose of that? Wouldn't the loop through buffersOnLevels
lists take care of the root node too?

The calculations in initBuffering() desperately need comments. As does
the rest of the code too, but the heuristics in that function are
particularly hard to understand without some explanation.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#94Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#92)
Re: WIP: Fast GiST index build

Split of an internal node works like this:

1. Gather all the existing tuples on the page, plus the new tuple being
inserted.
2. Call picksplit on the tuples, to divide them into pages
3. Go through all tuples on the buffer associated with the page, and
divide them into buffers on the new pages. This is done by calling
penalty function on each buffered tuple.

I wonder if it would be better for index quality to pass the buffered
tuples to picksplit in the 2nd step, so that they too can affect the
split decision. Maybe it doesn't make much difference in practice..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#95Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#94)
Re: WIP: Fast GiST index build

On Thu, Aug 11, 2011 at 10:21 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Split of an internal node works like this:

1. Gather all the existing tuples on the page, plus the new tuple being
inserted.
2. Call picksplit on the tuples, to divide them into pages
3. Go through all tuples on the buffer associated with the page, and divide
them into buffers on the new pages. This is done by calling penalty function
on each buffered tuple.

I wonder if it would be better for index quality to pass the buffered
tuples to picksplit in the 2nd step, so that they too can affect the split
decision. Maybe it doesn't make much difference in practice..

I had this idea. But:
1) Buffer contain much more tuples than page plus new tuple.
2) Picksplit method can easily be quadratic for example.

Let's see the complexity of picksplit algorithms:
1) geometric datatypes (point, box etc) - O(n) (BTW, I have serious doubts
about it, i.e. O(n*log(n)) algorithm can be in times better in many cases)
2) pg_trgm and fts - O(n^2)
3) seg - O(n*log(n))
4) cube - O(n^2)

Thus, I believe such feature should be an optional. We can try it as add-on
patch.

------
With best regards,
Alexander Korotkov.

#96Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#93)
1 attachment(s)
Re: WIP: Fast GiST index build

On Wed, Aug 10, 2011 at 11:45 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

unloadNodeBuffers() is now dead code.

processEmptyingStack calls it.

LEAF_PAGES_STATS_* are unused now.

Removed.

Should avoid calling smgrnblocks() on every tuple, the overhead of that
could add up.

Now calling at each BUFFERING_MODE_SWITCH_CHECK_STEP(256) tuples.

I wonder, how hard would it be to merge gistBufferingBuildPlaceToPage(**)
with the gistplacetopage() function used in the main codepath? There's very
little difference between them, and it would be nice to maintain just one
function. At the very least I think there should be a comment in both along
the lines of "NOTE: if you change this function, make sure you update XXXX
(the other function) as well!"

I doubt they can be effectively merged, but will try.

In gistbuild(), in the final emptying stage, there's this special-case
handling for the root block before looping through the buffers in the
buffersOnLevels lists:

for (;;)

{
nodeBuffer = getNodeBuffer(gfbb,
&buildstate.giststate, GIST_ROOT_BLKNO,

InvalidOffsetNumber, NULL, false);
if (!nodeBuffer || nodeBuffer->blocksCount <= 0)
break;
MemoryContextSwitchTo(gfbb->**context);
gfbb->bufferEmptyingStack = lcons(nodeBuffer,
gfbb->bufferEmptyingStack);
MemoryContextSwitchTo(**buildstate.tmpCtx);
processEmptyingStack(&**buildstate.giststate,
&insertstate);
}

What's the purpose of that? Wouldn't the loop through buffersOnLevels lists
take care of the root node too?

I was worried about node buffer deletion from list while scanning that
list gistbuild(). That's why I avoided deletion from lists.
Now I've added additional check for root node in loop over list.

The calculations in initBuffering() desperately need comments. As does the
rest of the code too, but the heuristics in that function are particularly
hard to understand without some explanation.

Some comments were added. I'm working on more of them.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.13.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.13.0.patch.gzDownload
���BNgist_fast_build-0.13.0.patch�[yw����|�
3cf1��X���F2	�	���������^HW7�$��>w��Y�{'g�M-�n�����.�VK8=/r{:�������N���M��7�*[�~�����X �����N���������7���������I���wO<w���8�-�X97����	?�R-�dO(���tn��T�*������	6G�J|yr���"k!��Q������~��VDS�;�L
��BO�I-�a5W�\8�'b���&�H�R��;�M&v�Z���0Q�d���Z$2�4���B���
��ji�����-��T���e�/�h�����2v��+����s���+`V�~'�=K��yN�h��<�P���y�"I>R��8J��

�L�.���d��G�o���5�y��:],�����H�0w*�Sby�N(��	`&���k�.��72�5)MUx�&]��x�t>�p��7|������O$����7 X��)M#�_� �(�����#��@�&=�K*G�+lAI��k���V��\�'��e��B{-e|�E[Z����-�\��y�*�#��O/R����>���=|qx��	�b�8��}����x�M�����T�
TT�����@o��(W\uP��G����������[g��y��Y�AFx��:�I������>L#P&0���w�q�Y�d���S���l9�a��ZH2f���&Vr
J�I���Z8��/Z�U�z��Us|�Yh�5��It;�8��# ��:
������+:(\��I��n��+g
6�Ks�0qThG��Os]�&����X���l��J��,|��T��G5��1:h��������"Y��Y�]�:�g�|���D�2�Y��#WjM"Mb5������v�l�e1�0��_sx0s���U$"r�4F�������6�9%�t,��Q3��*��S�M�h����O�5Y�N�-���r����e��9p�E
�����mY��R�����Q�������)/q6�;�8�b K%���o>|�r�1�j#S�A�M�4���R�:"]��R��k���`���{��_!��d�3��@��H�E,�k�r��xt����i�D�kA�|��(�8���~J:b��@���"����
t��P4e
���L����% b�?I��t���CTv��);i��@A=�l�;R�B����$L+�2M~�����Q��xB� ���H`�E	w�t�S������sg	�D�6�$��1"��'�?��
����J�H�h��o@}:��_��,>@�b���9�;��Gq.��Q>Fr�I���������O���x���
R���z�?�*f�z]l������ �D�|�'��'���zS=���Et������s�9��wj��p��F��F�G���;Q{u�*��o�<g��jp:�_������bx�v�3���x���jp1�7�"� B���u [9�g�����G=b�F8����B~��`W������O��F	�;Y�\TF@�����������!��G���J�O`{�u�0R"�E�EzJ�)t�%r���pP����=_�Z��gg���_XB��p��V����?�M,����L���}3�,����K$����I�������������_pX��
��X��kB���9M�[o=���Aa�{���hU���v�J���������b��w7;+�m
���������`�i�'^#L�E����������10n��%z^�}.�lN;�?���*y��NZX5<~��n��m�o��b0�s*���SAd�������E��a����%8w�i��7(N���$���$
!�A2��-��]7aJ1Ka
 m0�l�t�h!ta	�\j��[��$1I�g�(@�`Tz���$���0����m�P^:��2P�,ds�-�`}������"d�������qQE{������(�~�`�������1�i{�Ep����
����W�:�\_��(��;�W��P\���{]���h�W��eo/���;����T6_a7!z-���
2�
�b6!���gC�*'�v��??`Pb��~g�'W����/��4����v����u�V��.?�'?��L�
N?�?D�h|yz=��_���o��m�������j���y���)���M���*aZ0��l�y�����XxB,���0����M�@wh0'!�>��XA0# �^K�}�2�Ej�����/>�F_����W����?l��U.'x�����{�/�{��G����q���G��������6 ����1m)����|���c��Z��{��:n����������CoB�Z2����x�$b�`Q�g6�����C���Qz-���D�f�HG��"�@C�B��I��?Ak����(����^���=���� ���(���}:�c�g�,&A��wV ��[�oqL��TAME�[*�A+��w�#7X4��M�4��G�u���Jb��U�����?k��.���%��3�����~���3�l�]�������n�N�����ew���C��s+������R������2O�0�����F`��Ql��A����8>k������
�;E����<�����q�40_f�6��$���7�
��R���xr��x�2w�S����np�&B�6���uj���%�g{��f~�6
z�~�]2�*��A{�8/<�wS��`�/��Y�a��^r�8h�?�OZ�C]*<�F�����"x�)8D��
�I� 
��M�`N�k;�X*G|:�����nk��/�c`��L,4Y7'�o��l+G�_�����*A��1z���*��dy��_���s.Cs�+��;�Y\�����c�����7G?��� �C���1�w�}�_����_��V��vT7�iW���
���-x2f��e�����R���������e�w��;��A�N1��i5��VYvCF�vG��#�O<�g����T�w*��;���=lU~r�0�y�cB��=l�b��n�$�'��yg����M`v)f/�j���@@1G��\b���A���b-�x�m��%�&������9��E�wL���[�7]>����%]�������w���@8���_�(��p���������w�zW\G|xMW�^�V���*��g������lrv9]~�\
�����	�S��	����{b�)����h-�;z���@y�O~���
h�<���hx�������~E{��S�b��8$���!(�w���&M��s�-����;��d�
�"�Q����d��O�:6��H�ya :GL5K�RP8x3�v�xA�X���'����N���	�U��q%�J�
o�*^c>^����jz��	���g>?C���n ����%���Ij�{�l���������������|��l�@u'W��Z~�^��:r=���#*s��z���[f�:x*T`G�*�P~�MjT�aWvM{PP���1�&-%J����cM�s����rY0��]Q�~�r�Q&��T��F{�h�*n�v�t�q�5`E���2����n������Ve�v�&'_51���?a�_5�����V��V��8���P.��#]s��9�Y�*2�%|J�J7b��#G'���m��q+x�&�(]2��uW\�u�:��;W
�����8���8�c^N1�D��C�����>rY��_BK�<�m
�e��;(2����0O�[:\(������N�>�b�V��ub���S���~�m�7_�E�([��!*���a�6���)^��<:.8:#$s$���b�):����)���`������ERc��"��(�7/�"�tIG�E���4�8��� �3������0���:L���|��m��'�c��np��5Vfy�����dK�`�''�&K�jxp/9�&AC�NAj�\�p��Ly�5�Mb���`m"�T���V������h��2�3B�����t*��|��aC��C�M�����;���d�O���bhU
��b�/��W#�h���Yl2R=%Tt#3f����)����-9�H#�}7)���P���=R.L���=(x.�Q7��T'qv�CG�&�,���l�;�l��i��k8/d
V��Y��������8��U��@|
\���D�L~'x��Y�\�4d��o�6�H�	�?��
���g��R�%��������J���(���XDI�����o]��	�7;<l!�tC�*�J�U1B�{K	�x5��f����K9	C+s����~K��2�]�|ED�I�Js�3O����'�`�����i����<hcY�L����P8.�Q\!��j��K��v%^��f��}r0�e �"�k�7����p.S����B���Z�,-"�����+;�;��Y�c��3��L��g�h		q��&�q)���r��c��r���9	���j�ud�(�X��[����5u(�cc�L/J��"� �"��������i�<��w���G%�B�������FL��d�c?#b���G��2=h*��(�XU
f����.�����J����lP����f�o`#O�R���Vh���K��=KvgN�������y�j�zL���Z(R�b*�����1���2��\��#����������8�'����6��$�b��7�|�Od�tD�%>E�m�[������0	&���
������pON��C�����w������F�bH�����ja��9Ik*��\p
�=^E*}O��RX�@�����+��O�"���)�E��Q��&�OG��� �������$��t��1Lb0�[{�|�]%��~�����"�|"���1��4z��������I�G��DHF�p��?�B�1�o\I�A	��������nx���9���u^Li-����7Ji-���t�W;-Z���
.���[�`l:��e�T��o�{<�Y�4S"��G;��p��hQ�lHc�^&�EJ�>)�d��g��ZV�v@1��9O�����.o�����St�.q���9�0&��7)��hH
�K�g�tIbM�u������~��x�;���.�aZ?���K��Y#B�����)�(1���t�N�)d��Y����*\R0�VH���u��>3=��+{�W�9h�8���n�|r>C������xp�xJ:���X��N}�D��f<�������{���������VL�%��������3w��tk5ZT���VCtB^���L��	C��nb�Q���u��
Ot�N�646"��{\�������Ia]l�b��c��+��W|����=r��\��2��{���k�Y��������&|)5���\��)9>X����D��[7Q�>^8�44s��y�mN�B)l^C�dN��(�{7�aR��I�G�������A�nr!W�|��L�T���
������Mq|L.eruy9���pq�}i�c��EM��,������7W���z@��������pehw6
N��V����[�Uq����� P��td%��d���G��J����]$���s%D�kj���[��`m�F���r�$�&�c9��>��c����������f�lE�!����
���L������J����m�)q2|����s����JHn�GD�0@0@&����\[0�r�x44���=V��c��V_�
���&:��,�1z4r��r�d$�pi��&���;����Hfn�'V�3K)�:�V�1;'JM@���f���5�}���-G���[3�`���'q���t��Z������������6�-���O���`	���0`�	.�$�f�����Z2��������]����>s3��Y������]����1����v.�i�A��iF������3�� F�;��)N.��NM�&4g��v��6�uj�-���_;i�f���S�)�i������O���AzR��!�	��`l� ��a����P+J[F���j��J���w]�7�m��[8��^V_�T;v���U���]�J��j?�;�prh�v���5)�8���jH�W7��<:�v�o*�k�NI&�n�`�����I-N�o��Q�
3gR���ez������P���* NC���_Y���� �����t��*=f	���W���W8^M���BI�Q�6���	%�RJ��4R�yq1�.!�_5��L��o�n�_e6�WY;��%��t?�
eH��&J�t2��R�7FA84����.���|��B/7��X���..���������@2`��F
������1&�M��k�f���n��CkC#���~MG%� ��u�6e��#�����,JJ����`���'|MV`�/��I���FndL�N�g�f���?�1_�fJf�Z�tb��J���@��(��0�SP��0v��l88���?8x��svt���6N��p����]�J6�'�6:��j��i�������hX����f+C�(�dt�&�el`��k�����M7/��/�����_<k?~�A:����j�hf�������$�MQ�WuZ�������!��C���$T=D%��Y���7��C8(tR�&�%�[��i��IL�+�CA���p��#!�
�d�sH��7f�����l?�x���'H��2�����p�q0�mXK1�q�L\C`��Zd5�6�
B,�l3+�>0�4@�������ZJ0�6�&c��
�!$��!�#�qG$F�����h�C�����'��������u;q�|���h��������� �4�����^���s�?�����om<Yi���QY�k ���99�������p����0&�X�{YRk���1Cf���&e���F����{�HYO
e=u���Y���u���/��?VD��������l&	�{;9BF��-K|8�a:x�#?��LA��@^=#�P���pdPX�Z/�AXJ�����b�D��rI��)�������~��6�d{H{����s�M��u�&�t��N�O���-Ay�j���/�K����L�.�Qv)�0���9V�E7�&�%M>�~�?k�x�x��l]a=m/�C�5z�:�b����gx�����C��j?�x5�CE�"�,\M���w�$�Zd��6���+�m^+��@Y��e#��0#����,�w�rw�{����r�{�!�w�A�~�I�6x������np�o�����o��aH����;�����������b@}��42�(;G�(����)������������u���>?Mx�6eI���Qa��a��p����X�����)���D~#I]��  �9�k�7���`D��btC<��m%��|����F[�y��D�d�2������y]=b���)�9"�N0�f�hf��<U#���;<����;-�]d�l�v ���'H#D'`,�{$����=�>3�����g��M��ct��G�!�yXr8�6W9^��xi$���.��O�Ix/��4~u1�{��������2�n5����=. 9e8?x�d/3�	�����a�����5���8Z�m��������?�����������no���%��=�C�m��J`���X���X���S/�$�#��4��+4"	�h0�������d
uh�
��;�����H��8���������`w�{����D�����yo�3���}C�W���/
���G��lD<r��6���7q�;+���
�
h
������$��[�/��w���p���9����3��{+�5!��`�mU�U
�z�L����h�S�<�����p�8�u��^#�������f�;s�W�8�<��G2Ue�"���Ywr���	0c�37@����q�6���D���~���E6�rw��|m��g���#�^����x��^y��u�Q��&��/���rsR`V��rQ`7�yX6`�N��!p�1��N.7���C�����UB�i~[}e��)���na�������T[L��]FuY`�0t:�j�1x������.~'���`8f�����~����}o:����sv�0��TJ�k��q�
�6"=���.`c�hZ��Ff�[���o�u�J6��[��Uj��>?7�Ia�x�[Y'�@����d��[���6���.b2���-<������$:�(��%����%"��F�I4����Q������<��])����	��O�O�W{~^L'��B	1_���Q��m�<�c�r,��^�0����,���� ��q������� ��	��T�
��h)��Q\�����h�E_�Ng�<P��@��A��$K�E�oh��Gg�u`~�cL��93�a�
; '��@���0R��J~"���P�z���T�?EB��9�OAzJ�~�9�p���^���f��1�J�Y�}
�ez��KDVk� F��;��3���q��T����QNR6D�:���_[�
/o�������j�D����	o�5��q�d�+�5���k�$�9��U��8dC�Z�/������dVl=r�G���{"+G�T[9�Nud$������T�{J�-��������yz�h���t��z;Y�����'8<�Xh'���G�u������%G��[�v	a���CI:�>��}�w���
t'��#B[�����-�0*"�������L~��^�]��Dh���u�Be=N�I2qF�Y@F�8�"�8�%�c�1�M��
�G1���,�	���^�d�^h3D"y����y�����+�u�U\�����,Vq������u[��4�����	X` ���M�(jJ���+��Gp��s>������~�5}od�O{ �"0��h��sJ��Z}eo��r.k[���;0s_��&�������9��7���O�����n���[�,pPHy��� �(�h�X��6��uC0S�W��K��
�'��D���hc����#L�A�}h��	��?���[�B�iP�����~L��k�5������y���4��:�/���W����C�La��'��Qp���n�%)�sP�6��+�$=1��+vH���?G�'�0��p��(<���D3T���v.���5��M'$�������NH���_�;�u�=n.����[����z#���Q;y��4yP;;J�`����e��\���Q�q>�
8C�	��J!�,����m!=r�W
����Bc�9 ��p���bic����X<�M�/PW"�u��d����o�{�6r������wvb��Pk^��	����
D-F_pfj��|�N������r��
�Z���y4�Jo0?�����9���
�
8`F����$Eb�G�zb]�rn����V�k�11�$��������I������L�����$vD����2��
I��^%F����Bv*)�����Mt��k�
'���!y�<f��),V�,��$T#sS�I��1�� �([iS41,Y�0Kq����F1���=���*���R����l�=��67�-M���F����B?�/~��D���������Z�{J�Rb�����`��2�� �����6����VleE�O�����������������������|���n�'h�Y���8�0�n3p*!D��'��6m��z�;�f��d�����
��9�Cn0���9
F��%mf���7��
���9�����g�M��Yy	E�RTH�6��k'������na�+__+3���+a2%	}��)�L��G-y��4���0�p��0�@N�i�����	�Z����f�eozT�	�z�ss��?��EU8�9���7"��z]��\fW��n��TH�a�m�..��uo����^��?Y~����pX]��t��F'KK��C�)�S�fi�:��i?���7g�����r����5y� ��4_�V�>lC=� �&�{E)������2f�(�^s%L$gG�G?�&��;+a�x�����K���)�����&��o��t>�e".KZ��4i���+'=;O�� �f	��K:��?��z��m�����J+*^M�r6c��-�����"h �R�`!��g�#eA�G)I�Z_
�����H.���=��IMs�<7>�o5��0���p�l��Yk�J��o�a�j�.L�&)�����V<�'�1�40��l�JV���pl���n�,2���P��t��H��Bq���� �\Y!�?�I�X�=r���h�������r�-a��J�"H���	]d1���W�r�x��r�?k��d3Q�|x���e���L~J6��]�������sg�[uO����-��z��^����C���8��8�P��������2T�Xt��^e��!a&G"($���X���{^�@��DX5*)%N���`YP������Fm��$�����#���@��)��
K���,ar�je��wH{��X�@v��&I�

��k�2>�5��u��e ���f�NZ���e�IKO�<6��!����
W��W�2s��[@CD��^P	n*�ve� 
/(����I�VA��Sr:C�g$�O]��-h��T���2	��}oA
����w��.j�A|~
!��
>���ow�����-����v.�&�i4f���7�nY8B��v���.�;���+�w�#�9��$sm���e����~[�����'6&��zz��sqxy�5�z>e7�1Mqo�x>.�����YO��=b�uY����D_S����1-������C�]��I J��0y�k�e�����K�� ��n*�9L`Q_��JqP����������*�"u�����1V�9)���^�)�8����q����H4����8���27��E|������0���j��"�|
������$�K/��0��'��C�n�~�@����z�>��X,s{���/@�w#jI�q��v�����������8���T��@O��+��7v0��t�O;G�������[�d9-/A�<+*������("�_6������
ZS-�q���+
�~����5LO���G}E4�Z��.z�3����-���J����Q��xx����UL�L[�4����D@�;�% g��j������b��E���6Nd��	�La\�����^�T��d����YD�����%pz�D,W�N����\8�U��0l��i�F�
z���b�#s�!��,"p��GW��5g���,h� �2�~�����1�a���
#��@��r�+\�����&�#�����"�� o%������e�Sr�>�b�g�h��#`u���QJhIQ7D2��QRx�F��m�l(���
	�E~�~�{Aq�4�M�8������qH��~�t����l.kq�m�V�h��f�o�������
�D"�e.������6�">�_3�C�2����%���	H�"]�*���@B,��8��I�!�?�W�Uwt�������s��+^����hD���a�G�U�"�K,[C�y8��W�����^To
��eO��^�
1���!��2P���Q(-F�"��K��Bm'\�# �{�%z�?�'�Z���8�^\�b�y3��.e��a���h��[�A�d�V�S4ZH�#N�
�P��Wl
]*�p��3�S0x��h|��K�>Iu�v��]�����b�V���>%�Nr"9?��Z�=l*���1��8����>����+�h����d��ue]�����]�l�p�@�L�Gp����
����s���q��#J���9)^�,4������7��V�`�����R3��iz,:����
���e/\i��e�����u3�#@�8�2ydB�~�
�h�}��i9��RXr�����p�S6���2����	lU�3\���b9����&� �`^ZL�����R�OC��E~g�Eu�={�/�8O�������2���3{���XQ�I�������SN�s���Zk�Vd�������6��s5v��GY����
xr�QB8kqbN\�d�b��7�����:r�A�(u}6��2��n��G����<�Q`���Y��tQ��_q
�T#H�;�s���V����^^�)�2t��"���^�v+��M����iP#�l3�J�l�<��K9_��!G��U�Y� K�W�d��RY�M���<�p�>��GE�h#�����B@�P.�=��_�pV�4����M���kz�j	��9��z�~���e�oa��m��}f�`�_�laZ^"���&�L��\R�����/�k]���MQG�|htO]l�;��^��s�B	`|�Su_����"�%���+�8'�Hh��/�W�^4���e�EF�/]1�h�n Fb���dV���H�������w)]aw�n�+V�V��2����X�]��y"�(,�8�<�y�#N�+�p�P�u�0X��	��w��w��W�/(N�4�>"�UXP�`7 ��5v�D��IO�������t���X�Cw�5=�V��*5^Q�����~���z�X��=d{f����N}QS�������h4��G�c���	*��\au�� a�5AB7&������K��%�z��"�K����PS33/a�3�g������b<�~,v
^^~o.Z�v,�~Lu���:h`5;���CH��gs������M�n��!DM:�k���� W�Q�Z���_�����������?���z�D��	W=���$g�G�7��4���!
jA)4��k���z7��(g��S����7�(���@��>���F�!j�E��\��p��.�F���7G��A��3������jbo�Y���7�;
x,cL��m;a^X�Q�����_�����>��*��
OF��H#��0�+zR�go"��}d���V6�"��e��qh�#_���3��|D���9�|D!�|��*�@t��Y��e?C70$����JQ�Y��P������1� 3����j�\����:�_s�V����[(xT!��k3�+4�"�$�@E��}"������@>8�u��w�S�OA�x�JVI4�;��Z�D���������U �I���m[��J��^$#�'x��\�d��t������/����3(������zu�?S'�H��W�P��J*]���G��8�~tg�[j��������Q��<H(��IL91�M%���U,�s��#c�ms��g��l0������e��,6<d����#�a"67 y�Q�(��F�q���$�_��E��'R&t-9@m�[�q����`���������Y�1�h4�T����K��a�7��b�$�����.e��8���j���D�C�b^/#~��Lu�=��Q���l���ua�'V����l.5w�+`����B��
���U����m��oa��������I$�0P�a�L'��WC��R�HD�w��%�����AE�m��7>����}�n��|�
/�=h8���?+s�6f���a�"���$�����LU��X��u���~N;�L��u���:���ll:J?��uIz%q�$q��x$����C�j�W�!��Q�]��8+�|+��4i!������[T�)�@��zR0���B��oO��v��Iv�8�X�0�3;jbV�s����eZ�|(*���5�|�)�a36r�������D�N��8b�c�o7s�&�����<J�����3zM#4+��L�Rjp����k���G)�������f#�XZCz�'��6��@e����8�s�BX	cE�b��-����V�X�)CdQ��KNK� ���J6<7z�(��Z^�Ee����&�@��X���l�
�O�bH��B���nq�5������[�w�vR��w=��qS���Yi��{Q1!9����H
�CD�>],CD�>�R�h(����]'����R�d�b�20't?�?T]����~��Y�0
�.�LH),��������*�y�e���Lu-A��qr	8�hF�3�#�����E����#�P}%[.;�(�o;o��Lk[�����CW���6Z�A��f��������r����u��=�����L�n�**��h2n�������Oh?�,��W|���1E�DtO[��+t1kZ�,��t��_��,^|�}3������3,�>���v��j�}p������u8T�o�>��R���=�%g�4wl/��H�%����>���n��.���d���/)���yC���&�u��b��g��DPS��/��3�Dg�W�$Z����>G������HV�p[]�z�.�D�.������"���a�5tS�A����r����Z��3[���>R����4�n�%:��U1��]�L?7��Nw��9�?���!sv�P.��K��;��c���(_Q}
T?��F�1��e��Z����X�r�ccP1��pb��&d�a#�#��n���Z��N�K�����n��|����9��>�������N��
Vn6��B�.?�vv�[�@j���n
�<Zfi�9����I���I8��F2�;����������Nr3d�!�3����;u)�^8�%���N	��l�X*��9���Vn��P\M
�\��s����_�M���3�0����<[��6.kO�"wV�^�-\�
�j�<�
8�V$��	�H���_�T��_IR�Cj'M�jiGz�!�_-�d��>�UP��.����;w�U��y�t~�������es$R7���V���WaZ��w��o����������HS�����Fg��/.C�������h�z$����(�������5)�wX1�Oez�c�� ElM�R|�G�����-�9�����WTr�=V�,��-�Y%
�=�2� �S ���
t������'�Py�]������dd��>��|�V�912����O�c�C�m�f��j*�����"�`����E_�7��~v���?4��O9����
!dT���*���E�5��*�s��HB�&��>�����!���Z��&6�E#��|��@E�}=����y1�k�Bl����#��^ ����$�;n���R�	�K�L�[��?P��?�E
N�E�,|���
}����H����_%�@���������+�8��M�PR]����^.�V=��2G���'d[sN:OjA���z<+NT�p2���
�c� ��#mp�\���������������Ckz��8|yy9y��>?���{�������5�%#����aw�	���h�|�}����9������$�qq5{����9sh�����HUm�}��t\P�hu<��m%��|����F��C����u����.�������yb4����q�����a�����'�QkD���}��jd��{�g�o�w�A����b'mW�v��:�-E?��)m%z�H��
�����zg4�?�M��B���]�pt���6m�r�O6�d�p�
�<2Kuu1�{���������FR
O�W>2�3�#����^����Z�����~`=�<O�J����6��^;�#�H�������w�e�������d����P�W�j�-r��e�U�{Qi�py�����@-�;�,�����3Z�v���vK��������1ff�����
�W�rq2�M�<��bG��Q_� r,�D�=��y�}�n�����L�����r�3��3��)�&T4��1���U������C������$
��k�i:8j9������$O��OH�?-�<����#��D>y}�n�Lb�/��m[Ui�}D������C���*��*7V��r���%��g{����Q/��s@���ZMeYts
���S�I����6���F�D��)�)�
���O6uK3�q�6���M/g��\;��������{v�	=[�!].���/�.J�,��������e�G�������p;{{����w�������)���3������^��H���H	��
����Lm1��,��p��N��)|�o~�p�!�z���1B�cht�����Z�q�����-�%�y����B?��e�if������J�������{`��������5��.��1�����lz��0%�����Eo|�.+�K��1�V��*���� g�h��=k~�����sy����3��Y�W�����������(F���(�4�9?�2��\<��J9��\��M/AH$�����8y�hTK��Vd�ZI(�dD;�x2�6<�0e�f+mf�h
��`�u��wc��
ET�\�!]��Tgz�Y��2K���f���']����b�����O|��%%���������%�&�I�k���?�8�@��,�/���}�-�yudg{�b8�5
n�#�1L@M
��y���Zl�u%�����x,<�����0�Ch�F-L����v*y
Px���!1��W6�1�@X_C#����B"14������/�d�D��O,��4d}�xM<���n�V�.H�����`b�!�:��q��������Kmy�z���f{z]��s��C> �c���H�[��gY(��0057u�!�9	�Q���Q�Sd%����a�7m��@��28�&O~�g�B5�W5�Q�$5��
��6��N�ef�\�5����~�o�����(=�L#�9})?C�?X&
�������v�])�Y�gu:S��fJ�j�N�^�Z�
���rT&��zPn���g>Z����D������}8��	�t��p��L�����V��#��
pzQ��	��0�k��7��9��S����9]��)�� W�����I�����N����y�v��hv5K
���x_%D�����iC��yr;�yU�����UkccD�/$�(d���Z�]������Dl+������6����o�:oN��:���;{M7���1e�4l��}��9z��t�lS���.����{�F������60Nk�^��!����"J�e�� �sT]�q����]�u{�-f�n:�y�"P3��YC�vU���HG���&p�s-R�5$�P�9q����"R�LXG6_�V��
���������C�H�x]�����	���.����C� ��c��g���4N��y�����T��"������{��h��0��9�h�|��q��VV�O<��R�wyCS����xU��[�������d��ZC�6���x!������iT�VE�1M�H��'VQ�=o��$�!T�%~�8��t�0�C�������;�TJV�|oA�`j(������N\�a��	�'���T��Y���S9�`fE6Gl���o��/Y�]I�GO�s��DY���\[��.x�����=#�	��{��C;p��M��%�zFkk8{a$���B�4�b��(�u)��i���T��OO2����X@'��b��R6��dl��:e+u�*����vWvE"������ #4l��n.��<,�&M�<;�c�A2�����P����������f�j�(c���i��04�I��u�U�K*��B��MPx0��H)��X�>k;K���
V�$`��Q�
S�>�uB�">h�#B�t��@!����[��i�B�������*����>��z����y��?���?>���Nc*6)�>
�BR�Z����P
$�E�a���]%�e"8�������W���S�\K�����<;��

z�E�$�m�Q�q:�fx_�$4V�2I1�ch����������)�0�Ps��
�k�j�9[P�����4�g���Wf8X�f{W����9jzP�y�R�ia��?����B
|9�Q�);T`"��<������O_�.��j�����td�8��x*�+
����!��^���L�����V�C��j�]<9��h�[�Q^x�1�#���������9��P����)B�� F�������5��ea�8��$wC�v�T�ah�a��J1-F�����;�_@�����m4�M�}[��,,T%����bOy���8��g��+�T�<���i�l�[ g��j5���Q��K����G��`�A{�+B�e�V��Y�����$�s�O�"��������P!x��	��az�N��6��EH�����}b��.6����~����!��KB?���8������sK.�=��������~3h~'s.�1��_�����Q�]�o�[$���9�]o�'b\u�+#��P`K�T���������X�Y��Tm}�,a��o6��f"�M~Ib����4:���n
����0���W��&��C�O�V�<	�
pX?&�a���i[���`�b�+f�:5TeF�*��t�����d��8f��k	f�:�",J��1�w����WWY����6������0�+r�QZb������������V�����$�Bx��t0�������C�DI�YB�J� �w�Ql�*a5����u�bk-��o�,���[Y�$Wc�=�aDF�`�:G�6�&���/����1#�����a��D����B:�G�"���&��������V���Y�����<�h���J��x��� �k�w������/��7Da�v�x���;�UY6��M`
�����E��P��>��dLU�O����a�;X�������K"���^�W�{�]"6<���������
��_<J���WXC�J��l5���\�C����v@�X24��1��&�/�.��9E�|V�ox�X��Z��T��;F���5!�-�a2�����`��~���C@�N�����l5����HP������T��@ Gw�����e���������h{�g'�h`����"�1zv����W��6�d����Az�_������|�&O�~\�M�lTn��Y�G,�4����!��N��?[`�|��
3$&��-��9��5K��K+K����%�\�����Yn�X�>���#����u^C�� ���G��TD$��M�z��rt-�U`����)����}Z�a���6)������&���32�d
�����X��,��JF>��On������y�`��������ykN���<PK�Q��8��G.A��ef��'���n����� �q�|<�&��g�>�\�7:%�4vL�N�r�dTT��������+Q��At1mP@��i��S���)J�,.�{���zY�Q�U���(cI`�$�K?v�-9���"����#������D�B%�q��D@��F��?�s_G�R��L�|4���l����%�#��>��=���!"�����s��3�Ap�Q��,Z����Rf=8b��d�� ���l�M��U������B(T^���w�����plu6B��d���/{>�C��Iy�N2VY�{��R y���8�Rp�(��t(�[�q�X����������	T�sW���{�E]�cE-�6����� ���"W��|]%�N���J����<#�
�r�q'����
3��)U�r�L����L�c��qw�����y��MF��vxRZv}�	j�����8W�Th�|py[9�����f;���d�T�u�?c0�U�n'M�����q|��?�bMI&t�[�$(o�Fc5z0��54#�\�D�v��SD6����ou������s�����k�l�RyC'x�
��e<a�_K�=������_���\rz]2P0�h"f����m6/J���6GS����{g�N�����%��R��u��';�=��30#�(���H��QN�:����3:����������H�vf�E�c��l��!������,�r2�!���������6	q�$����5�n}����>=�������9Y����D^������i:~����R-):�����
��f��JY�?*��LF�REP6���
��a:���}��3.ySX}E��^�t#h�~�h����!��"�?������_���{+�B�����?@^����������y���{�t�������l���0�������bz#W���4�1[�yp�t����16���1����������������a�6���P�u�c���B�������k�������D�]\�{�A��H7���_5������b(���-������0�g�h�:S���w���p����c�A������M���%�
�^@�Lx/�5��h
$���I���$lE<�u�Ip�n����}�@'�h8�0g�s�!(��bs�TC��k�-*In�K2�q����z��W+P��o�:��$b[���&G�~���:{\�7��3�U�AI��!�i��Lt8���M�w5(�%��I�
�MQ.��
\�}�����y��p��Q(N|��q����n=j�;qH������y���n�^O���s�$;Bb�(��[tV&WY
��Rq�H�V������4���)�����r^��6�g����"����f5Z��:S��O�Z-o���n��V�*�%�����|���<��e��k�@��a�����������?���&L9%I�c�E�\}������"a"~3�����v�����}��f���'��~����q�l���@;9�;8:>��l�L,-����J������B��������C�_���p�4�����N;�7,���
����-uE5:l�A|���#��#Z�W�1�P��i�2�dfy'�����	M>�s�O��M=�z��c��uf�������N��6�Cg&�h�n�y���VMs*������A?��
�9=;�?|;�'��b�/��/�Q����F�>����X�Xr��q�����6}8�C4�3���1�����P1�H���rNp��x[:��vb' �����X����I�:��0|�-�|#�lx1�4}��Z��t?���lq}����/��b!��
c�������/=�z����I�-���+@��a;���#`������
��j@�4��m69���f�H�Z}���^lv��D��W��q�m���/�a���e�v+�$?��fLk�b�W+[��%�}(���0��Ma����F����h
�F���R>��?j9��L����aA�|������O�OSq��Q�=*:?Pt���BN;n
BO��u��c�o����� �\wf����lO^cf��?d}�����V��0����YN�=e�EDDO7��U�������+�s��������O7�����U��t��.�:-F��z�y+�6];�d��-��/p?4=�H~��I��~�s�����bz���/�x����j�U^W����Z�	:�|q��G�b1���.��/���WE�^����[-R�Z"��N>Z���Q\=F���a�'�&������	[?�HC��R!�Q|K�?q��/@����2�N&M���)�����p���-�*,�����Fx�h���`�J�e�tP1���O�`��Vew�+�7���1����8�G����<��@&�F���&���O���~p��3�6��m�����/~S�k���!<Ot[�	�V�&��I%����6�W�Ri����7>R���]��`�C���%�QH���?�����<+�����c�_�W�^�{Lo,��W2�J2a�o���Hz�����y��q����s���I	���A�L��gw&d����i�\c��e�m #��Cd�/Wf�{.����y�`u�n9,_����PQ�h{�p8��i�o�E�>00U�
D��	�����c�p�e�P,���G3{�{%�lf�w��G�$�hx������/�*� ������Gq�3���P�'F���R|?.��n��:���J}����S+����2m%�K�vVNGh�wu��7���C��}�.�T,,����#���]`����l��:v��/�E'^bH73b�c'V��5�Z)l���'������� ��� �=N�������H���\tXl�"����D1O���D�)����dc�![`���c8(l���
���<Y�%"E���/6�O�>��o
'�@�eG�C&qL�}�dC8aq�B�LgBh����������D���ujO�b���R��)J�"�L���,�j��cT���-��t���a0�m	r�c|����3�$��-���F�\dC���8X�[S!�36�q�qa%5�e�%��E�'��>C(���a���!��}��=^�k�zK9H��P*fP����2�UJ����~����������qt���a����@��1TB�L�}����p�]�����)��U6 ���NK����*���}�#J�M
��:yw��'�����J)�&i5�{�"�K$�Bf���[���[3�y�n���2�L����Q��;g51�,��g4)F�|�:�}���d�Io�m�n�)��8����*
��\�����,��"amu�^T��jo�-]^2*TZ\���/�0d���az�w�h5m�\4F~
a�#�����V�����'U�X����@k�����&���\�+���n�	Hp�d��[!����g���QS�&�#��9����d��2�����y���l54<���E1�U�2d���nX����[����+:��.����	k��*���������S��/��=�T=���+x�*���P~FTT�����K������E,�����UIP�ZE������g
����?k?��7Z9���cv��Q}��eS��}�,�tk]�
*dl��
���������rO��fu|��G)e�����5,��v�O��Q�b>y�~�l]��<F
ea�M������H��<,���n�_��������a�T������&�Y
@A�^��D�bj�^�i7�[�`e�"��f���m,��[s}mmc}�� ~��yZV�'~�N�@9#�0Z�""��������t4
�B��k���:H��Zn:�#�|���p]Q{��]��F���E�k���^<y��~���Gz�Y�!{}�sv�8��|������s�t��k�����D��oB�9��s��7�L5��8�%����z�	Y����
��^E��:�����	^Z~���2����������o�,�Z�H���(������(������m���f�f���{�Y���\���Bn�%v�F����L�NoA;w���.^�B8a6���/0���O�1��yG����K[
#97Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#92)
Re: WIP: Fast GiST index build

On 10.08.2011 22:44, Alexander Korotkov wrote:

Manual and readme updates.

Thanks, I'm reviewing these now.

Do we want to expose the level-step and buffersize parameters to users?
They've been useful during testing, but I'm thinking we should be able
to guess good enough values for them automatically, and just remove the
options. It's pretty much impossible for a user to tune them correctly,
it would require deep knowledge of the buffering algorithm.

I'm thinking that even when you explicitly turn buffering on, we should
still process the first 10000 or so tuples with simple inserts. That way
we always have a sample of tuples to calculate the average tuple size
from. It's plausible that if the input data is ordered, looking at the
first N tuples will give skewed sample, but I don't think there's much
danger of that in practice. Even if the data is ordered, the length of
GiST tuples shouldn't vary much.

What happens if we get the levelstep and pagesPerBuffer estimates wrong?
How sensitive is the algorithm to that? Or will we run out of memory?
Would it be feasible to adjust those in the middle of the index build,
if we e.g exceed the estimated memory usage greatly?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#98Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#92)
2 attachment(s)
Re: WIP: Fast GiST index build

On 10.08.2011 22:44, Alexander Korotkov wrote:

Manual and readme updates.

I went through these, and did some editing and rewording. Attached is an
updated README, and an updated patch of the doc changes. Let me know if
I screwed up something.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

gist-bufferbuild-docupdates-edited.patchtext/x-diff; name=gist-bufferbuild-docupdates-edited.patchDownload
diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index 78171cf..244a2e2 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -642,6 +642,35 @@ my_distance(PG_FUNCTION_ARGS)
 
   </variablelist>
 
+ <sect2 id="gist-buffering-build">
+  <title>GiST buffering build</title>
+  <para>
+   Building large GiST indexes that don't fit in cache by simply inserting
+   all the tuples tends to be slow, because if the index tuples are scattered
+   across the index, a large fraction of the insertions need to perform
+   I/O. The exception is well-ordered datasets, where the part of the index
+   where new insertions go to stays well cached.
+   PostgreSQL from version 9.2 supports a more efficient method to build
+   GiST indexes based on buffering, which can dramatically reduce number of
+   random I/O needed.
+  </para>
+
+  <para>
+   However, buffering index build needs to call the <function>penalty</>
+   function more often, which consumes some extra CPU resources. Also, it can
+   infuence the quality of the produced index, in both positive and negative
+   directions. That influence depends on various factors, like the
+   distribution of the input data and operator class implementation.
+  </para>
+
+  <para>
+   By default, the index build switches to the buffering method when the
+   index size reaches <xref linkend="guc-effective-cache-size">. It can
+   be manually turned on or off by the <literal>BUFFERING</literal> parameter
+   to the CREATE INDEX clause.
+  </para>
+
+ </sect2>
 </sect1>
 
 <sect1 id="gist-examples">
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 1a1e8d6..1b2969e 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -341,6 +341,52 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ <replaceable class="parameter">name</
    </varlistentry>
 
    </variablelist>
+   <para>
+    GiST indexes accept the following parameter:
+   </para>
+
+   <variablelist>
+
+   <varlistentry>
+    <term><literal>BUFFERING</></term>
+    <listitem>
+    <para>
+     Determines whether the buffering build technique described in
+     <xref linkend="gist-buffering-build"> is used to build the index. With
+     <literal>OFF</> it is disabled, with <literal>ON</> it is enabled, and
+     with <literal>AUTO</> (default) it is initially disabled, but turned on
+     on-the-fly once the index size reaches <xref linkend="guc-effective-cache-size">.
+    </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>LEVELSTEP</></term>
+    <listitem>
+    <para>
+     In buffering build buffers located at tree levels i * <literal>LEVELSTEP</>, 
+     i > 0 (we use upward level numbering, level = 0 corresponds to leaf pages).
+     By default <literal>LEVELSTEP</> is calculated so that sub-tree
+     of <literal>LEVELSTEP</> height fits <xref linkend="guc-effective-cache-size">
+     and <xref linkend="guc-maintenance-work-mem">.
+    </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>BUFFERSIZE</></term>
+    <listitem>
+    <para>
+     Maximum size of node buffer in pages. By default it is calculated so that
+     half emptying of node buffer fill in average one page per underlying node
+     buffer. This ratio guarantees effective IO usage. In some cases lower
+     <literal>BUFFERSIZE</> can give comparable IO economy with less CPU
+     overhead.
+    </para>
+    </listitem>
+   </varlistentry>
+
+   </variablelist>
   </refsect2>
 
   <refsect2 id="SQL-CREATEINDEX-CONCURRENTLY">
READMEtext/plain; name=READMEDownload
#99Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#97)
Re: WIP: Fast GiST index build

On Thu, Aug 11, 2011 at 2:28 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 10.08.2011 22:44, Alexander Korotkov wrote:

Manual and readme updates.

Thanks, I'm reviewing these now.

Do we want to expose the level-step and buffersize parameters to users?
They've been useful during testing, but I'm thinking we should be able to
guess good enough values for them automatically, and just remove the
options. It's pretty much impossible for a user to tune them correctly, it
would require deep knowledge of the buffering algorithm.

I'm thinking that even when you explicitly turn buffering on, we should
still process the first 10000 or so tuples with simple inserts. That way we
always have a sample of tuples to calculate the average tuple size from.
It's plausible that if the input data is ordered, looking at the first N
tuples will give skewed sample, but I don't think there's much danger of
that in practice. Even if the data is ordered, the length of GiST tuples
shouldn't vary much.

What happens if we get the levelstep and pagesPerBuffer estimates wrong?
How sensitive is the algorithm to that? Or will we run out of memory? Would
it be feasible to adjust those in the middle of the index build, if we e.g
exceed the estimated memory usage greatly?

I see the following risks.

For levelstep:
Too small: not so great IO benefit as can be
Too large:
1) If subtree doesn't fit effective_cache, much more IO then should be
(because of cache misses during buffer emptying)
2) If last pages of buffers don't fit to maintenance_work_mem, possible
OOM

For buffersize:
Too small: less IO benefit, becuse buffer size is relatively small in
comparison with sub-tree size.
Too large: greater CPU overhead (because of more penalty calls) then can be
with same IO benefit.

Thereby I propose following.
1) Too large levelstep is greatest risk. Let's use pessimistic estimate for
it. Pessimistic estimate has following logic:
largest sub-tree => maximal tuples per page => minimal tuple size
Thereby always using minimal tuple size in levelstep calculation we exclude
greatest risks.
2) Risks of buffersize are comparable and not too critical. Thats why I
propose to use size of first 10000 tuples for estimate.

------
With best regards,
Alexander Korotkov.

#100Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#99)
Re: WIP: Fast GiST index build

On 11.08.2011 23:30, Alexander Korotkov wrote:

On Thu, Aug 11, 2011 at 2:28 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

On 10.08.2011 22:44, Alexander Korotkov wrote:

Manual and readme updates.

Thanks, I'm reviewing these now.

Do we want to expose the level-step and buffersize parameters to users?
They've been useful during testing, but I'm thinking we should be able to
guess good enough values for them automatically, and just remove the
options. It's pretty much impossible for a user to tune them correctly, it
would require deep knowledge of the buffering algorithm.

I'm thinking that even when you explicitly turn buffering on, we should
still process the first 10000 or so tuples with simple inserts. That way we
always have a sample of tuples to calculate the average tuple size from.
It's plausible that if the input data is ordered, looking at the first N
tuples will give skewed sample, but I don't think there's much danger of
that in practice. Even if the data is ordered, the length of GiST tuples
shouldn't vary much.

What happens if we get the levelstep and pagesPerBuffer estimates wrong?
How sensitive is the algorithm to that? Or will we run out of memory? Would
it be feasible to adjust those in the middle of the index build, if we e.g
exceed the estimated memory usage greatly?

I see the following risks.

For levelstep:
Too small: not so great IO benefit as can be
Too large:
1) If subtree doesn't fit effective_cache, much more IO then should be
(because of cache misses during buffer emptying)
2) If last pages of buffers don't fit to maintenance_work_mem, possible
OOM

Hmm, we could avoid running out of memory if we used a LRU cache
replacement policy on the buffer pages, instead of explicitly unloading
the buffers. 1) would still apply, though.

For buffersize:
Too small: less IO benefit, becuse buffer size is relatively small in
comparison with sub-tree size.
Too large: greater CPU overhead (because of more penalty calls) then can be
with same IO benefit.

Thereby I propose following.
1) Too large levelstep is greatest risk. Let's use pessimistic estimate for
it. Pessimistic estimate has following logic:
largest sub-tree => maximal tuples per page => minimal tuple size
Thereby always using minimal tuple size in levelstep calculation we exclude
greatest risks.
2) Risks of buffersize are comparable and not too critical. Thats why I
propose to use size of first 10000 tuples for estimate.

Yep, sounds reasonable.

I think it would also be fairly simple to decrease levelstep and/or
adjust buffersize on-the-fly. The trick would be in figuring out the
heuristics on when to do that.

Another thing occurred to me while looking at the buffer emptying
process: At the moment, we stop emptying after we've flushed 1/2 buffer
size worth of tuples. The point of that is to avoid overfilling a
lower-level buffer, in the case that the tuples we emptied all landed on
the same lower-level buffer. Wouldn't it be fairly simple to detect that
case explicitly, and stop the emptying process only if one of the
lower-level buffers really fills up? That should be more efficient, as
you would have "swap" between different subtrees less often.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#101Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#100)
Re: WIP: Fast GiST index build

On Fri, Aug 12, 2011 at 12:23 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

I think it would also be fairly simple to decrease levelstep and/or adjust
buffersize on-the-fly. The trick would be in figuring out the heuristics on
when to do that.

I would be simple to decrease levelstep to the it's divider. It seems quite
hard to dicrease it, for example, from 3 to 2. Also, it's pretty hard to
detect that sub-tree actually doen't fit to the cache. I don't see much
difficulties in buffersize runtime tuning.

Another thing occurred to me while looking at the buffer emptying process:
At the moment, we stop emptying after we've flushed 1/2 buffer size worth of
tuples. The point of that is to avoid overfilling a lower-level buffer, in
the case that the tuples we emptied all landed on the same lower-level
buffer. Wouldn't it be fairly simple to detect that case explicitly, and
stop the emptying process only if one of the lower-level buffers really
fills up? That should be more efficient, as you would have "swap" between
different subtrees less often.

Yes, it seems reasonable to me.

------
With best regards,
Alexander Korotkov.

#102Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#96)
Re: WIP: Fast GiST index build

On Thu, Aug 11, 2011 at 6:21 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

[ new patch ]

Some random comments:

- It appears that the "noFollowFight" flag is really supposed to be
called "noFollowRight".

- In gist_private.h you've written "halt-filled" where you really mean
"half-filled".

- It seems like you're using reloptions to set parameters that are
only going to do anything at index creation time. IIUC, "BUFFERING",
"LEVELSTEP" and "BUFFERSIZE" have no permanent meaning for that index;
they're just used ephemerally while constructing it. If we're going
to expose such things as options, maybe they should be GUCs, not
reloptions.

- Function names should begin with "gist" or some other, appropriate
prefix, especially if they are non-static. decreasePathRefcount(),
getNodeBuffer(), relocateBuildBuffersOnSplit(), adn
getNodeBufferBusySize() violate this rule, and it might be good to
change the static functions to follow it, too, just for consistency,
and to avoid renaming things if something that's currently static
later needs to be made non-static.

- validateBufferOption needs to use ereport(), not elog().

- This needs a bit of attention:

+               /* TODO: Write the WAL record */
+               if (RelationNeedsWAL(state->r))
+                       recptr = gistXLogSplit(state->r->rd_node,
blkno, is_leaf,
+                                                               dist,
oldrlink, oldnsn, InvalidBuffer, true);
+               else
+                       recptr = GetXLogRecPtrForTemp();
+

I don't think the comment matches the code, since gistXLogSplit() does
in fact call XLogInsert(). Also, you should probably move the
RelationNeedsWAL() test inside gistXLogSplit(). Otherwise, every
single caller of gistXLogSplit() is going to need to do the same
dance.

- In gistBufferingPlaceToPage, you've got a series of loops that look like this:

+ for (ptr = dist; ptr; ptr = ptr->next)

The first loop allocates a bunch of buffers. The second loop sets up
downlink pointers. Then there's some other code. Then there's a
third loop, which adds items to each page in turn and sets up right
links. Then there's a fourth loop, which marks all those buffers
dirty. Then you write XLOG. Then there's a fifth loop, which sets
all the LSNs and TLIs, and a sixth loop, which does
UnlockReleaseBuffer() on each valid buffer in the list. All of this
seems like it could be simplified. In particular, the third and
fourth loops can certainly be merged - you should set the dirty bit at
the same time you're adding items to the page. And the fifth and
sixth loops can also be merged. You certainly don't need to set all
the LSNs and TLIs before releasing any of the buffer locks & pins.
I'm not sure if there's any more merging that can be done than that,
but you might want to have a look.

I'm also wondering how long this linked list can be. It's not good to
have large numbers of buffers locked for a long period of time. At
the very least, some comments are in order here.

Another general comment about this function is that it seems like it
is backwards. The overall flow of the function is:

if (is_split)
{
/* complicated stuff */
}
else
{
/* simple case */
}

It seems like it might be better to flip that around and do this:

if (!is_split)
{
/* simple case */
return result;
}
/* complicated stuff */

It's easier to read and avoids having the indentation get too deep.

- As I look at this more, I see that a lot of the logic in
gistBufferingBuildPlaceToPage is copied from gistplacetopage(). It
would be nice to move the common bits to common subroutines that both
functions can call, instead of duplicating the code.

- On a related note, gistBufferingBuildPlaceToPage needs to do
START_CRIT_SECTION and END_CRIT_SECTION at appropriate points in the
sequence, as gistplacetopage() does.

- gistFindCorrectParent() seems to rely heavily on the assumption that
there's no concurrent activity going on in this index. Otherwise,
it's got to be unsafe to release the buffer lock before using the
answer the function computes. Some kind of comment seems like it
would be a good idea.

- On a more algorithmic note, I don't really understand why we attach
buffers to all pages on a level or none of them. If it isn't
necessary to have buffers on every internal page in the tree, why do
we have them on every other level or every third level rather than,
say, creating them on the fly in whatever parts of the tree end up
busy?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#103Alexander Korotkov
aekorotkov@gmail.com
In reply to: Robert Haas (#102)
1 attachment(s)
Re: WIP: Fast GiST index build

Hi!

Thank you for your notes.

On Fri, Aug 12, 2011 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 11, 2011 at 6:21 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

[ new patch ]

Some random comments:

- It appears that the "noFollowFight" flag is really supposed to be
called "noFollowRight".

Fixed.

- In gist_private.h you've written "halt-filled" where you really mean

"half-filled".

Fixed.

- It seems like you're using reloptions to set parameters that are
only going to do anything at index creation time. IIUC, "BUFFERING",
"LEVELSTEP" and "BUFFERSIZE" have no permanent meaning for that index;
they're just used ephemerally while constructing it. If we're going
to expose such things as options, maybe they should be GUCs, not
reloptions.

Having these as index parameters may be helpful when you reindex. It's
likely that you would like to rebuild it with same parameters as it was
created. Actually, we have the same situation with FILLFACTOR: it is used
only during index creation.

- Function names should begin with "gist" or some other, appropriate
prefix, especially if they are non-static. decreasePathRefcount(),
getNodeBuffer(), relocateBuildBuffersOnSplit(), adn
getNodeBufferBusySize() violate this rule, and it might be good to
change the static functions to follow it, too, just for consistency,
and to avoid renaming things if something that's currently static
later needs to be made non-static.

Fixed.

- validateBufferOption needs to use ereport(), not elog().

Fixed.

- This needs a bit of attention:

+               /* TODO: Write the WAL record */
+               if (RelationNeedsWAL(state->r))
+                       recptr = gistXLogSplit(state->r->rd_node,
blkno, is_leaf,
+                                                               dist,
oldrlink, oldnsn, InvalidBuffer, true);
+               else
+                       recptr = GetXLogRecPtrForTemp();
+

I don't think the comment matches the code, since gistXLogSplit() does
in fact call XLogInsert(). Also, you should probably move the

RelationNeedsWAL() test inside gistXLogSplit(). Otherwise, every

single caller of gistXLogSplit() is going to need to do the same
dance.

- In gistBufferingPlaceToPage, you've got a series of loops that look like
this:

+ for (ptr = dist; ptr; ptr = ptr->next)

The first loop allocates a bunch of buffers. The second loop sets up
downlink pointers. Then there's some other code. Then there's a
third loop, which adds items to each page in turn and sets up right
links. Then there's a fourth loop, which marks all those buffers
dirty. Then you write XLOG. Then there's a fifth loop, which sets
all the LSNs and TLIs, and a sixth loop, which does
UnlockReleaseBuffer() on each valid buffer in the list. All of this
seems like it could be simplified. In particular, the third and
fourth loops can certainly be merged - you should set the dirty bit at
the same time you're adding items to the page. And the fifth and
sixth loops can also be merged. You certainly don't need to set all
the LSNs and TLIs before releasing any of the buffer locks & pins.
I'm not sure if there's any more merging that can be done than that,
but you might want to have a look.

I'm also wondering how long this linked list can be. It's not good to
have large numbers of buffers locked for a long period of time. At
the very least, some comments are in order here.

Another general comment about this function is that it seems like it
is backwards. The overall flow of the function is:

if (is_split)
{
/* complicated stuff */
}
else
{
/* simple case */
}

It seems like it might be better to flip that around and do this:

if (!is_split)
{
/* simple case */
return result;
}
/* complicated stuff */

It's easier to read and avoids having the indentation get too deep.

- As I look at this more, I see that a lot of the logic in
gistBufferingBuildPlaceToPage is copied from gistplacetopage(). It
would be nice to move the common bits to common subroutines that both
functions can call, instead of duplicating the code.

- On a related note, gistBufferingBuildPlaceToPage needs to do
START_CRIT_SECTION and END_CRIT_SECTION at appropriate points in the
sequence, as gistplacetopage() does.

While, I've merged gistplacetopage() and gistBufferingBuildPlaceToPage().
Now I'm trying some more refactoring.

- gistFindCorrectParent() seems to rely heavily on the assumption that
there's no concurrent activity going on in this index. Otherwise,
it's got to be unsafe to release the buffer lock before using the
answer the function computes. Some kind of comment seems like it
would be a good idea.

Corresponding comment was added.

- On a more algorithmic note, I don't really understand why we attach
buffers to all pages on a level or none of them. If it isn't
necessary to have buffers on every internal page in the tree, why do
we have them on every other level or every third level rather than,
say, creating them on the fly in whatever parts of the tree end up
busy?

Idea of having buffers on levels with some step is following. We have enough
of cache to have a sub-tree of some height fits to cache. When we loaded
such sub-tree once we can process index tuples inside it effectively
(without actual IO). During buffer emptying we're flushing index tuples to
undeflying buffers or leaf pages. Having buffers on levels with step we
guarantee that flushing don't require loading(and writing) more then such
sub-tree (which fits to cache). Thus, if we've processed many enough of
index tuples during emptying, it's IO effective. It's possible that some
more effective distribution of buffers exists, but it's currently unclear
for me.

Other changes:
1) Levelstep and buffersize user options were removed.
2) Buffer size is now run time tuned.
3) Buffer emptying now stops when some child can't take index tuple anymore.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.14.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.14.0.patch.gzDownload
#104Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#103)
1 attachment(s)
Re: WIP: Fast GiST index build

I found that I forgot to remove levelstep and buffersize from reloptions.c.
Updated patch is attached.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.14.1.patch.gzapplication/x-gzip; name=gist_fast_build-0.14.1.patch.gzDownload
#105Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#104)
Re: WIP: Fast GiST index build

Looking at the calculation of levelStep:

+ 	/*
+ 	 * Calculate levelStep by available amount of memory. We should be able to
+ 	 * load into main memory one page of each underlying node buffer (which
+ 	 * are in levelStep below). That give constraint over
+ 	 * maintenance_work_mem. Also we should be able to have subtree of
+ 	 * levelStep level in cache. That give constraint over
+ 	 * effective_cache_size.
+ 	 *
+ 	 * i'th underlying level of sub-tree can consists of
+ 	 * i^maxIndexTuplesPerPage pages at maximum. So, subtree of levelStep
+ 	 * levels can't be greater then 2 * maxIndexTuplesPerPage ^ levelStep
+ 	 * pages. We use some more reserve due to we probably can't take whole
+ 	 * effective cache and use formula 4 * maxIndexTuplesPerPage ^ levelStep =
+ 	 * effectiveCache. We use similar logic with maintenance_work_mem. We
+ 	 * should be able to store at least last pages of all buffers where we are
+ 	 * emptying current buffer to.
+ 	 */
+ 	effectiveMemory = Min(maintenance_work_mem * 1024 / BLCKSZ,
+ 						  effective_cache_size);
+ 	levelStep = (int) log((double) effectiveMemory / 4.0) /
+ 		log((double) maxIndexTuplesPerPage);
+

I can see that that's equal to the formula given in the paper,
log_B(M/4B), but I couldn't see any explanation for that formula in the
paper. Your explanation makes sense, but where did it come from?

It seems a bit pessimistic: while it's true that the a subtree can't be
larger than 2 * maxIndexTuplesPerPage ^ levelStep, you can put a tighter
upper bound on it. The max size of a subtree of depth n can be
calculated as the geometric series:

r^0 + r^1 + r^2 + ... + r^n = (1 - r^(n + 1)) / (1 - r)

where r = maxIndexTuplesPerPage. For r=2 those formulas are equal, but
for a large r and small n (which is typical), the 2 *
maxIndexTuplesPerPage^levelStep formula gives a value that's almost
twice as large as the real max size of a subtree.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#106Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#105)
Re: WIP: Fast GiST index build

On Tue, Aug 16, 2011 at 4:04 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

I can see that that's equal to the formula given in the paper, log_B(M/4B),
but I couldn't see any explanation for that formula in the paper. Your
explanation makes sense, but where did it come from?

I didn't find it too. But it has to reservse memory for both sub-tree and
active buffers. While we'are reserving memory for sub-tree in
effective_cache_size and memory for last pages of buffers in
maintenance_work_mem.

It seems a bit pessimistic: while it's true that the a subtree can't be
larger than 2 * maxIndexTuplesPerPage ^ levelStep, you can put a tighter
upper bound on it. The max size of a subtree of depth n can be calculated as
the geometric series:

r^0 + r^1 + r^2 + ... + r^n = (1 - r^(n + 1)) / (1 - r)

where r = maxIndexTuplesPerPage. For r=2 those formulas are equal, but for
a large r and small n (which is typical), the 2 * maxIndexTuplesPerPage^**levelStep
formula gives a value that's almost twice as large as the real max size of a
subtree.

Thus, we can calculate:
levelstep = min(log_r(1 + effective_cache_size_in_pages*(r - 1)) - 1,
log_r(maintenance_work_mem_in_pages - 1))
and get more precise result. But also we need at least very rough estimate
of memory occupied by node buffers hash tab and path items.

------
With best regards,
Alexander Korotkov.

#107Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#104)
Re: WIP: Fast GiST index build

Why is there ever a buffer on the root node? It seems like a waste of
time to load N tuples from the heap into the root buffer, only to empty
the buffer after it fills up. You might as well pull tuples directly
from the heap.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#108Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#107)
Re: WIP: Fast GiST index build

On Tue, Aug 16, 2011 at 9:43 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Why is there ever a buffer on the root node? It seems like a waste of time
to load N tuples from the heap into the root buffer, only to empty the
buffer after it fills up. You might as well pull tuples directly from the
heap.

Yes, seems reasonable. Buffer on the root node was in the paper. But now I
don't see the need of it too.

------
With best regards,
Alexander Korotkov.

#109Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#108)
Re: WIP: Fast GiST index build

On 16.08.2011 21:46, Alexander Korotkov wrote:

On Tue, Aug 16, 2011 at 9:43 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

Why is there ever a buffer on the root node? It seems like a waste of time
to load N tuples from the heap into the root buffer, only to empty the
buffer after it fills up. You might as well pull tuples directly from the
heap.

Yes, seems reasonable. Buffer on the root node was in the paper. But now I
don't see the need of it too.

Here's an version of the patch with a bunch of minor changes:

* No more buffer on root node. Aside from the root buffer being
pointless, this simplifies gistRelocateBuildBuffersOnSplit slightly as
it doesn't need the special case for root block anymore.

* Moved the code to create new root item from gistplacetopage() to
gistRelocateBuildBuffersOnSplit(). Seems better to keep the
buffering-related code away from the normal codepath, for the sake of
readability.

* Changed the levelStep calculation to use the more accurate upper bound
on subtree size that we discussed.

* Changed the levelStep calculation so that it doesn't do just
min(maintenance_work_mem, effective_cache_size) and calculate the
levelStep from that. Maintenance_work_mem matters determines the max.
number of page buffers that can be held in memory at a time, while
effective_cache_size determines the max size of the subtree. Those are
subtly different things.

* Renamed NodeBuffer to GISTNodeBuffer, to avoid cluttering the namespace

* Plus misc comment, whitespace, formatting and naming changes.

I think this patch is in pretty good shape now. Could you re-run the
performance tests you have on the wiki page, please, to make sure the
performance hasn't regressed? It would also be nice to get some testing
on the levelStep and pagesPerBuffer estimates, and the point where we
switch to the buffering method. I'm particularly interested to know if
there's any corner-cases with very skewed data distributions or strange
GUC settings, where the estimates fails badly.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#110Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#109)
1 attachment(s)
Re: WIP: Fast GiST index build

On 16.08.2011 22:10, Heikki Linnakangas wrote:

Here's an version of the patch with a bunch of minor changes:

And here it really is, this time with an attachment...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

gist_fast_build-heikki-0.14.1.1.patchtext/x-diff; name=gist_fast_build-heikki-0.14.1.1.patchDownload
diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index 78171cf..244a2e2 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -642,6 +642,35 @@ my_distance(PG_FUNCTION_ARGS)
 
   </variablelist>
 
+ <sect2 id="gist-buffering-build">
+  <title>GiST buffering build</title>
+  <para>
+   Building large GiST indexes that don't fit in cache by simply inserting
+   all the tuples tends to be slow, because if the index tuples are scattered
+   across the index, a large fraction of the insertions need to perform
+   I/O. The exception is well-ordered datasets, where the part of the index
+   where new insertions go to stays well cached.
+   PostgreSQL from version 9.2 supports a more efficient method to build
+   GiST indexes based on buffering, which can dramatically reduce number of
+   random I/O needed.
+  </para>
+
+  <para>
+   However, buffering index build needs to call the <function>penalty</>
+   function more often, which consumes some extra CPU resources. Also, it can
+   infuence the quality of the produced index, in both positive and negative
+   directions. That influence depends on various factors, like the
+   distribution of the input data and operator class implementation.
+  </para>
+
+  <para>
+   By default, the index build switches to the buffering method when the
+   index size reaches <xref linkend="guc-effective-cache-size">. It can
+   be manually turned on or off by the <literal>BUFFERING</literal> parameter
+   to the CREATE INDEX clause.
+  </para>
+
+ </sect2>
 </sect1>
 
 <sect1 id="gist-examples">
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 1a1e8d6..1032c94 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -341,6 +341,26 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ <replaceable class="parameter">name</
    </varlistentry>
 
    </variablelist>
+   <para>
+    GiST indexes additionaly accepts parameters:
+   </para>
+
+   <variablelist>
+
+   <varlistentry>
+    <term><literal>BUFFERING</></term>
+    <listitem>
+    <para>
+     Determines whether the buffering build technique described in
+     <xref linkend="gist-buffering-build"> is used to build the index. With
+     <literal>OFF</> it is disabled, with <literal>ON</> it is enabled, and
+     with <literal>AUTO</> (default) it is initially disabled, but turned on
+     on-the-fly once the index size reaches <xref linkend="guc-effective-cache-size">.
+    </para>
+    </listitem>
+   </varlistentry>
+
+   </variablelist>
   </refsect2>
 
   <refsect2 id="SQL-CREATEINDEX-CONCURRENTLY">
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 900b222..4514818 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -30,6 +30,9 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+
+static void validateBufferingOption(char *value);
+
 /*
  * Contents of pg_class.reloptions
  *
@@ -219,6 +222,17 @@ static relopt_real realRelOpts[] =
 
 static relopt_string stringRelOpts[] =
 {
+	{
+		{
+			"buffering",
+			"Enables buffering build for this GiST index",
+			RELOPT_KIND_GIST
+		},
+		4,
+		false,
+		validateBufferingOption,
+		"auto"
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1267,3 +1281,24 @@ tablespace_reloptions(Datum reloptions, bool validate)
 
 	return (bytea *) tsopts;
 }
+
+/*
+ * Validator for "buffering" option of GiST indexed. Allows only "on", "off" and
+ * "auto" values.
+ */
+static void
+validateBufferingOption(char *value)
+{
+	if (!value ||
+		(
+		 strcmp(value, "on") &&
+		 strcmp(value, "off") &&
+		 strcmp(value, "auto")
+		 )
+		)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("Only \"on\", \"off\" and \"auto\" values are available for \"buffering\" option.")));
+	}
+}
diff --git a/src/backend/access/gist/Makefile b/src/backend/access/gist/Makefile
index f8051a2..cc9468f 100644
--- a/src/backend/access/gist/Makefile
+++ b/src/backend/access/gist/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = gist.o gistutil.o gistxlog.o gistvacuum.o gistget.o gistscan.o \
-       gistproc.o gistsplit.o
+       gistproc.o gistsplit.o gistbuild.o gistbuildbuffers.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 2d78dcb..7cbd1e3 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -24,6 +24,7 @@ The current implementation of GiST supports:
   * provides NULL-safe interface to GiST core
   * Concurrency
   * Recovery support via WAL logging
+  * Buffering build algorithm
 
 The support for concurrency implemented in PostgreSQL was developed based on
 the paper "Access Methods for Next-Generation Database Systems" by
@@ -31,6 +32,12 @@ Marcel Kornaker:
 
     http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz
 
+Buffering build algorithm for GiST was developed based on the paper "Efficient
+Bulk Operations on Dynamic R-trees" by Lars Arge, Klaus Hinrichs, Jan Vahrenhold
+and Jeffrey Scott Vitter.
+
+    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9894&rep=rep1&type=pdf
+
 The original algorithms were modified in several ways:
 
 * They had to be adapted to PostgreSQL conventions. For example, the SEARCH
@@ -278,6 +285,134 @@ would complicate the insertion algorithm. So when an insertion sees a page
 with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
 crashed in the middle to completion by adding the downlink in the parent.
 
+Buffering build algorithm
+-------------------------
+
+In the buffering index build algorithm, some or all internal nodes have a
+buffer attached to them. When a tuple is inserted at the top, the descend down
+the tree is stopped as soon as a buffer is reached, and the tuple is pushed to
+the buffer. When a buffer gets too full, all the tuples in it are flushed to
+the lower level, where they again hit lower level buffers or leaf pages. This
+makes the insertions happen in more of a breadth-first than depth-first order,
+which greatly reduces the amount of random I/O required.
+
+In the algorithm, levels are numbered so that leaf pages have level zero,
+and internal node levels count up from 1. This numbering ensures that a page's
+level number never changes, even when the root page is split.
+
+Level                    Tree
+
+3                         *
+                      /       \
+2                *                 *
+              /  |  \           /  |  \
+1          *     *     *     *     *     *
+          / \   / \   / \   / \   / \   / \
+0        o   o o   o o   o o   o o   o o   o
+
+* - internal page
+o - leaf page
+
+Internal pages that belong to certain levels have buffers associated with
+them. Leaf pages never have buffers. Which levels have buffers is controlled
+by "level step" parameter: level numbers that are multiples of level_step
+have buffers, while others do not. For example, if level_step = 2, then
+pages on levels 2, 4, 6, ... have buffers. If level_step = 1 then every
+internal page has a buffer.
+
+Level        Tree (level_step = 1)                Tree (level_step = 2)   
+                                        
+3                      *(b)                                  *
+                   /       \                             /       \
+2             *(b)              *(b)                *(b)              *(b)
+           /  |  \           /  |  \             /  |  \           /  |  \
+1       *(b)  *(b)  *(b)  *(b)  *(b)  *(b)    *     *     *     *     *     *
+       / \   / \   / \   / \   / \   / \     / \   / \   / \   / \   / \   / \
+0     o   o o   o o   o o   o o   o o   o   o   o o   o o   o o   o o   o o   o
+
+(b) - buffer
+
+Logically, a buffer is just bunch of tuples. Physically, it is divided in
+pages, backed by a temporary file. Each buffer can be in one of two states:
+a) Last page of the buffer is kept in main memory. A node buffer is
+automatically switched to this state when a new index tuple is added to it,
+or a tuple is removed from it.
+b) All pages of the buffer are swapped out to disk. When a buffer becomes too
+full, and we start to flush it, all other buffers are switched to this state.
+
+When an index tuple is inserted, its initial processing can end in one of the
+following points:
+1) Leaf page, if the depth of the index <= level_step, meaning that
+   none of the internal pages have buffers associated with them.
+2) Buffer of topmost level page that has buffers.
+
+New index tuples are processed until one of the buffers in the topmost
+buffered level becomes half-full. When a buffer becomes half-full, it's added
+to the emptying queue, and will be emptied before a new tuple is processed.
+
+Buffer emptying process means that index tuples from the buffer are moved
+into buffers at a lower level, or leaf pages. First, all the other buffers are
+swapped to disk to free up the memory. Then tuples are popped from the buffer
+one by one, and cascaded down the tree to the next buffer or leaf page below
+the buffered node.
+
+Emptying a buffer has the interesting dynamic property that any intermediate
+pages between the buffer being emptied, and the next buffered or leaf level
+below it, become cached. If there are no more buffers below the node, the leaf
+pages where the tuples finally land on get cached too. If there are, the last
+buffer page of each buffer below is kept in memory. This is illustrated in
+the figures below:
+
+   Buffer being emptied to
+     lower-level buffers               Buffer being emptied to leaf pages
+
+               +(fb)                                 +(fb)
+            /     \                                /     \
+        +             +                        +             +
+      /   \         /   \                    /   \         /   \
+    *(ab)   *(ab) *(ab)   *(ab)            x       x     x       x
+
++    - cached internal page
+x    - cached leaf page
+*    - non-cached internal page
+(fb) - buffer being emptied
+(ab) - buffers being appended to, with last page in memory
+
+In the beginning of the index build, the level-step is chosen so that all those
+pages involved in emptying one buffer fit in cache, so after each of those
+pages have been accessed once and cached, emptying a buffer doesn't involve
+any more I/O. This locality is where the speedup of the buffering algorithm
+comes from.
+
+Emptying one buffer can fill up one or more of the lower-level buffers,
+triggering emptying of them as well. Whenever a buffer becomes too full, it's
+added to the emptying queue, and will be emptied after the current buffer has
+been processed.
+
+To keep the size of each buffer limited even in the worst case, buffer emptying
+is scheduled as soon as a buffer becomes half-full, and emptying it continues
+until 1/2 of the nominal buffer size worth of tuples has been emptied. This
+guarantees that when buffer emptying begins, all the lower-level buffers
+are at most half-full. In the worst case that all the tuples are cascaded down
+to the same lower-level buffer, that buffer therefore has enough space to
+accommodate all the tuples emptied from the upper-level buffer. There is no
+hard size limit in any of the data structures used, though, so this only needs
+to be approximate; small overfilling of some buffers doesn't matter.
+
+If an internal page that has a buffer associated with it is split, the buffer
+needs to be split too. All tuples in the buffer are scanned through and
+relocated to the correct sibling buffers, using the penalty function to decide
+which buffer each tuple should go to.
+
+After all tuples from the heap have been processed, there are still some index
+tuples in the buffers. At this point, final buffer emptying starts. All buffers
+are emptied in top-down order. This is slightly complicated by the fact that
+new buffers can be allocated during the emptying, due to page splits. However,
+the new buffers will always be siblings of buffers that haven't been fully
+emptied yet; tuples never move upwards in the tree. The final emptying loops
+through buffers at a given level until all buffers at that level have been
+emptied, and then moves down to the next level.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 4fc7a21..b140e0c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -24,15 +24,6 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
-/* Working state for gistbuild and its callback */
-typedef struct
-{
-	GISTSTATE	giststate;
-	int			numindexattrs;
-	double		indtuples;
-	MemoryContext tmpCtx;
-} GISTBuildState;
-
 /* A List of these is used represent a split-in-progress. */
 typedef struct
 {
@@ -41,16 +32,6 @@ typedef struct
 } GISTPageSplitInfo;
 
 /* non-export function prototypes */
-static void gistbuildCallback(Relation index,
-				  HeapTuple htup,
-				  Datum *values,
-				  bool *isnull,
-				  bool tupleIsAlive,
-				  void *state);
-static void gistdoinsert(Relation r,
-			 IndexTuple itup,
-			 Size freespace,
-			 GISTSTATE *GISTstate);
 static void gistfixsplit(GISTInsertState *state, GISTSTATE *giststate);
 static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 				 GISTSTATE *giststate,
@@ -89,138 +70,6 @@ createTempGistContext(void)
 }
 
 /*
- * Routine to build an index.  Basically calls insert over and over.
- *
- * XXX: it would be nice to implement some sort of bulk-loading
- * algorithm, but it is not clear how to do that.
- */
-Datum
-gistbuild(PG_FUNCTION_ARGS)
-{
-	Relation	heap = (Relation) PG_GETARG_POINTER(0);
-	Relation	index = (Relation) PG_GETARG_POINTER(1);
-	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
-	IndexBuildResult *result;
-	double		reltuples;
-	GISTBuildState buildstate;
-	Buffer		buffer;
-	Page		page;
-
-	/*
-	 * We expect to be called exactly once for any index relation. If that's
-	 * not the case, big trouble's what we have.
-	 */
-	if (RelationGetNumberOfBlocks(index) != 0)
-		elog(ERROR, "index \"%s\" already contains data",
-			 RelationGetRelationName(index));
-
-	/* no locking is needed */
-	initGISTstate(&buildstate.giststate, index);
-
-	/* initialize the root page */
-	buffer = gistNewBuffer(index);
-	Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
-	page = BufferGetPage(buffer);
-
-	START_CRIT_SECTION();
-
-	GISTInitBuffer(buffer, F_LEAF);
-
-	MarkBufferDirty(buffer);
-
-	if (RelationNeedsWAL(index))
-	{
-		XLogRecPtr	recptr;
-		XLogRecData rdata;
-
-		rdata.data = (char *) &(index->rd_node);
-		rdata.len = sizeof(RelFileNode);
-		rdata.buffer = InvalidBuffer;
-		rdata.next = NULL;
-
-		recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_CREATE_INDEX, &rdata);
-		PageSetLSN(page, recptr);
-		PageSetTLI(page, ThisTimeLineID);
-	}
-	else
-		PageSetLSN(page, GetXLogRecPtrForTemp());
-
-	UnlockReleaseBuffer(buffer);
-
-	END_CRIT_SECTION();
-
-	/* build the index */
-	buildstate.numindexattrs = indexInfo->ii_NumIndexAttrs;
-	buildstate.indtuples = 0;
-
-	/*
-	 * create a temporary memory context that is reset once for each tuple
-	 * inserted into the index
-	 */
-	buildstate.tmpCtx = createTempGistContext();
-
-	/* do the heap scan */
-	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
-								   gistbuildCallback, (void *) &buildstate);
-
-	/* okay, all heap tuples are indexed */
-	MemoryContextDelete(buildstate.tmpCtx);
-
-	freeGISTstate(&buildstate.giststate);
-
-	/*
-	 * Return statistics
-	 */
-	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
-
-	result->heap_tuples = reltuples;
-	result->index_tuples = buildstate.indtuples;
-
-	PG_RETURN_POINTER(result);
-}
-
-/*
- * Per-tuple callback from IndexBuildHeapScan
- */
-static void
-gistbuildCallback(Relation index,
-				  HeapTuple htup,
-				  Datum *values,
-				  bool *isnull,
-				  bool tupleIsAlive,
-				  void *state)
-{
-	GISTBuildState *buildstate = (GISTBuildState *) state;
-	IndexTuple	itup;
-	MemoryContext oldCtx;
-
-	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
-
-	/* form an index tuple and point it at the heap tuple */
-	itup = gistFormTuple(&buildstate->giststate, index,
-						 values, isnull, true /* size is currently bogus */ );
-	itup->t_tid = htup->t_self;
-
-	/*
-	 * Since we already have the index relation locked, we call gistdoinsert
-	 * directly.  Normal access method calls dispatch through gistinsert,
-	 * which locks the relation for write.	This is the right thing to do if
-	 * you're inserting single tups, but not when you're initializing the
-	 * whole index at once.
-	 *
-	 * In this path we respect the fillfactor setting, whereas insertions
-	 * after initial build do not.
-	 */
-	gistdoinsert(index, itup,
-			  RelationGetTargetPageFreeSpace(index, GIST_DEFAULT_FILLFACTOR),
-				 &buildstate->giststate);
-
-	buildstate->indtuples += 1;
-	MemoryContextSwitchTo(oldCtx);
-	MemoryContextReset(buildstate->tmpCtx);
-}
-
-/*
  *	gistbuildempty() -- build an empty gist index in the initialization fork
  */
 Datum
@@ -275,7 +124,6 @@ gistinsert(PG_FUNCTION_ARGS)
 	PG_RETURN_BOOL(false);
 }
 
-
 /*
  * Place tuples from 'itup' to 'buffer'. If 'oldoffnum' is valid, the tuple
  * at that offset is atomically removed along with inserting the new tuples.
@@ -293,19 +141,27 @@ gistinsert(PG_FUNCTION_ARGS)
  * In that case, we continue to hold the root page locked, and the child
  * pages are released; note that new tuple(s) are *not* on the root page
  * but in one of the new child pages.
+ *
+ * Also this function have some special behaviour in buffering build. It takes
+ * care about maintaining data structured of buffering build: creates new
+ * root path item if needed and relocates buffer of splitted node. Also it
+ * doesn't returns splitinfo to the caller but uses simplified downlinks
+ * insertion by recursive call.
  */
-static bool
+bool
 gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 				Buffer buffer,
 				IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
 				Buffer leftchildbuf,
-				List **splitinfo)
+				List **splitinfo,
+				GISTBufferingInsertStack * path)
 {
 	Page		page = BufferGetPage(buffer);
 	bool		is_leaf = (GistPageIsLeaf(page)) ? true : false;
 	XLogRecPtr	recptr;
 	int			i;
 	bool		is_split;
+	GISTBuildBuffers *gfbb = giststate->gfbb;
 
 	/*
 	 * Refuse to modify a page that's incompletely split. This should not
@@ -319,7 +175,14 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 	if (GistFollowRight(page))
 		elog(ERROR, "concurrent GiST page split was incomplete");
 
-	*splitinfo = NIL;
+	if (!gfbb)
+	{
+		/*
+		 * We haven't to return splitinfo in buffering build. Otherwise
+		 * initialize splitinfo as empty list.
+		 */
+		*splitinfo = NIL;
+	}
 
 	/*
 	 * if isupdate, remove old key: This node's key has been modified, either
@@ -408,6 +271,24 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 			GistTupleSetValid(ptr->itup);
 		}
 
+		/* Are we inside a buffering build? */
+		if (gfbb)
+		{
+			/*
+			 * Parent may be changed from the moment we set it. So, let us
+			 * adjust the parent.
+			 */
+			if (!is_rootsplit)
+				gistBufferingFindCorrectParent(giststate, state->r, path);
+
+			/*
+			 * Relocate index tuples from buffer of splitted page between
+			 * buffers of the pages produced by split.
+			 */
+			gistRelocateBuildBuffersOnSplit(giststate->gfbb, giststate, state->r,
+											path, buffer, dist);
+		}
+
 		/*
 		 * If this is a root split, we construct the new root page with the
 		 * downlinks here directly, instead of requiring the caller to insert
@@ -439,9 +320,12 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 			rootpg.next = dist;
 			dist = &rootpg;
 		}
-		else
+		else if (!gfbb)
 		{
-			/* Prepare split-info to be returned to caller */
+			/*
+			 * If we're not in buffering build then prepare split-info to be
+			 * returned to caller.
+			 */
 			for (ptr = dist; ptr; ptr = ptr->next)
 			{
 				GISTPageSplitInfo *si = palloc(sizeof(GISTPageSplitInfo));
@@ -474,7 +358,7 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 			else
 				GistPageGetOpaque(ptr->page)->rightlink = oldrlink;
 
-			if (ptr->next && !is_rootsplit)
+			if (ptr->next && !is_rootsplit && !gfbb)
 				GistMarkFollowRight(ptr->page);
 			else
 				GistClearFollowRight(ptr->page);
@@ -508,7 +392,8 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 		/* Write the WAL record */
 		if (RelationNeedsWAL(state->r))
 			recptr = gistXLogSplit(state->r->rd_node, blkno, is_leaf,
-								   dist, oldrlink, oldnsn, leftchildbuf);
+								   dist, oldrlink, oldnsn, leftchildbuf,
+								   gfbb ? true : false);
 		else
 			recptr = GetXLogRecPtrForTemp();
 
@@ -524,12 +409,51 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 		 * If this was a root split, we've already inserted the downlink
 		 * pointers, in the form of a new root page. Therefore we can release
 		 * all the new buffers, and keep just the root page locked.
+		 *
+		 * In buffering build due to no concurrent activity, we can use
+		 * simplified downlinks insertion. So in that case we also can release
+		 * all the new buffers.
 		 */
-		if (is_rootsplit)
+		if (is_rootsplit || gfbb)
 		{
 			for (ptr = dist->next; ptr; ptr = ptr->next)
 				UnlockReleaseBuffer(ptr->buffer);
 		}
+
+		if (gfbb && !is_rootsplit)
+		{
+			/*
+			 * Simplified insertion of downlinks during buffering build.
+			 */
+			IndexTuple *itups;
+			int			cnt = 0,
+						i;
+			Buffer		parentBuffer;
+
+			/* Count number of downlinks for insert. */
+			for (ptr = dist; ptr; ptr = ptr->next)
+			{
+				cnt++;
+			}
+
+			/* Allocate array of downlinks index tuples */
+			itups = (IndexTuple *) palloc(sizeof(IndexTuple) * cnt);
+
+			/* Fill that array */
+			i = 0;
+			for (ptr = dist; ptr; ptr = ptr->next)
+			{
+				itups[i] = ptr->itup;
+				i++;
+			}
+
+			/* Insert downlinks into parent. */
+			parentBuffer = ReadBuffer(state->r, path->parent->blkno);
+			LockBuffer(parentBuffer, GIST_EXCLUSIVE);
+			gistplacetopage(state, giststate, parentBuffer,
+							itups, cnt, path->downlinkoffnum, InvalidBuffer, NULL, path->parent);
+			UnlockReleaseBuffer(parentBuffer);
+		}
 	}
 	else
 	{
@@ -570,8 +494,6 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 			recptr = GetXLogRecPtrForTemp();
 			PageSetLSN(page, recptr);
 		}
-
-		*splitinfo = NIL;
 	}
 
 	/*
@@ -608,7 +530,7 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
  * this routine assumes it is invoked in a short-lived memory context,
  * so it does not bother releasing palloc'd allocations.
  */
-static void
+void
 gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 {
 	ItemId		iid;
@@ -917,8 +839,8 @@ gistFindPath(Relation r, BlockNumber child, OffsetNumber *downlinkoffnum)
 		{
 			/*
 			 * Page was split while we looked elsewhere. We didn't see the
-			 * downlink to the right page when we scanned the parent, so
-			 * add it to the queue now.
+			 * downlink to the right page when we scanned the parent, so add
+			 * it to the queue now.
 			 *
 			 * Put the right page ahead of the queue, so that we visit it
 			 * next. That's important, because if this is the lowest internal
@@ -1195,7 +1117,7 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 	is_split = gistplacetopage(state, giststate, stack->buffer,
 							   tuples, ntup, oldoffnum,
 							   leftchild,
-							   &splitinfo);
+							   &splitinfo, NULL);
 	if (splitinfo)
 		gistfinishsplit(state, stack, giststate, splitinfo);
 
@@ -1414,6 +1336,7 @@ initGISTstate(GISTSTATE *giststate, Relation index)
 		else
 			giststate->supportCollation[i] = DEFAULT_COLLATION_OID;
 	}
+	giststate->gfbb = NULL;
 }
 
 void
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
new file mode 100644
index 0000000..e0a0763
--- /dev/null
+++ b/src/backend/access/gist/gistbuild.c
@@ -0,0 +1,929 @@
+/*-------------------------------------------------------------------------
+ *
+ * gistbuild.c
+ *	  build algorithm for GiST indexes implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/gist/gistbuild.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/gist_private.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "miscadmin.h"
+#include "optimizer/cost.h"
+#include "storage/bufmgr.h"
+#include "storage/indexfsm.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+/* Step of index tuples for check whether to switch to buffering build mode */
+#define BUFFERING_MODE_SWITCH_CHECK_STEP 256
+#define BUFFERING_MODE_TUPLE_SIZE_STATS_TARGET 4096
+
+typedef enum
+{
+	GIST_BUFFERING_DISABLED,	/* in regular build mode and aren't going to
+								 * switch */
+	GIST_BUFFERING_AUTO,		/* in regular build mode, but will switch to
+								 * buffering build mode if the index grows
+								 * too big */
+	GIST_BUFFERING_STATS,		/* gathering statistics of index tuple size
+								 * before switching to the buffering build
+								 * mode */
+	GIST_BUFFERING_ACTIVE		/* in buffering build mode */
+} GistBufferingMode;
+
+/* Working state for gistbuild and its callback */
+typedef struct
+{
+	GISTSTATE	giststate;
+	int64		indtuples;
+	int64		indtuplesSize;
+
+	Size		freespace;	/* Amount of free space to leave on pages */
+
+	GistBufferingMode bufferingMode;
+	MemoryContext tmpCtx;
+} GISTBuildState;
+
+static void gistFreeUnreferencedPath(GISTBufferingInsertStack *path);
+static bool gistProcessItup(GISTSTATE *giststate, GISTInsertState *state,
+				GISTBuildBuffers *gfbb, IndexTuple itup,
+				GISTBufferingInsertStack *startparent);
+static void gistProcessEmptyingStack(GISTSTATE *giststate, GISTInsertState *state);
+static void gistBufferingBuildInsert(Relation index, IndexTuple itup,
+						 GISTBuildState *buildstate);
+static void gistBuildCallback(Relation index,
+				  HeapTuple htup,
+				  Datum *values,
+				  bool *isnull,
+				  bool tupleIsAlive,
+				  void *state);
+static int	gistGetMaxLevel(Relation index);
+static bool gistInitBuffering(GISTBuildState *buildstate, Relation index);
+static int calculatePagesPerBuffer(GISTBuildState *buildstate, Relation index,
+						int levelStep);
+
+/*
+ * Free unreferenced parts of path;
+ */
+static void
+gistFreeUnreferencedPath(GISTBufferingInsertStack *path)
+{
+	while (path->refCount == 0)
+	{
+		/*
+		 * Path part is unreferenced. We can free it and decrease reference
+		 * count of parent. If parent becomes unreferenced too procedure
+		 * should be repeated for it.
+		 */
+		GISTBufferingInsertStack *tmp = path->parent;
+
+		pfree(path);
+		path = tmp;
+		if (path)
+			path->refCount--;
+		else
+			break;
+	}
+}
+
+/*
+ * Decrease reference count of path part and remove unreferenced path parts if
+ * any.
+ */
+void
+gistDecreasePathRefcount(GISTBufferingInsertStack *path)
+{
+	path->refCount--;
+	gistFreeUnreferencedPath(path);
+}
+
+/*
+ * Process index tuple. Run index tuple down until it meet leaf page or
+ * node buffer. If it meets a node buffer then it is just placed to it. If it
+ * meet leaf page then actual insert takes place. Returns true if we have to
+ * stop buffer emptying process (one of child buffers can't take index
+ * tuples anymore).
+ */
+static bool
+gistProcessItup(GISTSTATE *giststate, GISTInsertState *state,
+				GISTBuildBuffers *gfbb, IndexTuple itup,
+				GISTBufferingInsertStack *startparent)
+{
+	GISTBufferingInsertStack *path;
+	BlockNumber childblkno;
+	Buffer		buffer;
+	bool		result = false;
+
+	/*
+	 * NULL passed in startparent means that we start index tuple processing
+	 * from the root.
+	 */
+	if (!startparent)
+		path = gfbb->rootitem;
+	else
+		path = startparent;
+
+	/*
+	 * Loop until we are on leaf page (level == 0) or we reach level with
+	 * buffers (if it wasn't level that we've start at, because we should move
+	 * forward at least in one level down).
+	 */
+	for (;;)
+	{
+		ItemId		iid;
+		IndexTuple	idxtuple,
+					newtup;
+		Page		page;
+		OffsetNumber childoffnum;
+		GISTBufferingInsertStack *parent;
+
+		/*
+		 * Do we meet a level with buffers? Surely buffer of page we start
+		 * from doesn't matter.
+		 */
+		if (path != startparent && LEVEL_HAS_BUFFERS(path->level, gfbb))
+			break;
+
+		/* Do we meet leaf page? */
+		if (path->level == 0)
+			break;
+
+		/* Choose child for insertion */
+		buffer = ReadBuffer(state->r, path->blkno);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+
+		page = (Page) BufferGetPage(buffer);
+		childoffnum = gistchoose(state->r, page, itup, giststate);
+		iid = PageGetItemId(page, childoffnum);
+		idxtuple = (IndexTuple) PageGetItem(page, iid);
+		childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+
+		/* Adjust key representing child if needed */
+		newtup = gistgetadjusted(state->r, idxtuple, itup, giststate);
+
+		if (newtup)
+		{
+			/*
+			 * Key adjustment was actually produced a new key. So, we need to
+			 * update it in the page.
+			 */
+			gistplacetopage(state, giststate, buffer, &newtup, 1, childoffnum,
+							InvalidBuffer, NULL, path);
+		}
+		UnlockReleaseBuffer(buffer);
+
+		/* Create new path item representing current page */
+		parent = path;
+		path = (GISTBufferingInsertStack *) MemoryContextAlloc(gfbb->context,
+										   sizeof(GISTBufferingInsertStack));
+		path->parent = parent;
+		path->level = parent->level - 1;
+		path->blkno = childblkno;
+		path->downlinkoffnum = childoffnum;
+
+		/* It's unreferenced just now */
+		path->refCount = 0;
+
+		/* Adjust reference count of parent */
+		if (parent)
+			parent->refCount++;
+	}
+
+	if (LEVEL_HAS_BUFFERS(path->level, gfbb))
+	{
+		/*
+		 * We've reached level with buffers. Now place index tuple to the
+		 * buffer and add buffer emptying stack element if buffer overflows.
+		 */
+		GISTNodeBuffer *childNodeBuffer;
+
+		/* Find node buffer or create a new one */
+		childNodeBuffer = gistGetNodeBuffer(gfbb, giststate, path->blkno,
+										  path->downlinkoffnum, path->parent,
+											true);
+
+		/* Add index tuple to it */
+		gistPushItupToNodeBuffer(gfbb, childNodeBuffer, itup);
+
+		if (BUFFER_HALF_FILLED(childNodeBuffer, gfbb) && !childNodeBuffer->queuedForEmptying)
+		{
+			/*
+			 * Node buffer was overflowed just now. Let's add it to the
+			 * emptying stack.
+			 */
+			MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
+			childNodeBuffer->queuedForEmptying = true;
+			gfbb->bufferEmptyingQueue = lcons(childNodeBuffer,
+											  gfbb->bufferEmptyingQueue);
+			MemoryContextSwitchTo(oldcxt);
+		}
+
+		if (BUFFER_OVERFLOWED(childNodeBuffer, gfbb))
+			result = true;
+	}
+	else
+	{
+		/*
+		 * We've reached leaf level. So, place index tuple here.
+		 */
+		buffer = ReadBuffer(state->r, path->blkno);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistplacetopage(state, giststate, buffer, &itup, 1,
+						InvalidOffsetNumber, InvalidBuffer, NULL, path);
+		UnlockReleaseBuffer(buffer);
+	}
+
+	/*
+	 * Free unreferenced path items if any. Path item may be referenced by
+	 * node buffer.
+	 */
+	gistFreeUnreferencedPath(path);
+
+	return result;
+}
+
+
+/*
+ * Find correct parent by following rightlinks in buffering index build. This
+ * method of parent searching is possible because no concurrent activity is
+ * possible while index builds.
+ */
+void
+gistBufferingFindCorrectParent(GISTSTATE *giststate, Relation r,
+							   GISTBufferingInsertStack *child)
+{
+	GISTBuildBuffers *gfbb = giststate->gfbb;
+	GISTBufferingInsertStack *parent = child->parent;
+	OffsetNumber i,
+				maxoff;
+	ItemId		iid;
+	IndexTuple	idxtuple;
+	Buffer		buffer;
+	Page		page;
+	bool		copied = false;
+
+	buffer = ReadBuffer(r, parent->blkno);
+	page = BufferGetPage(buffer);
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	gistcheckpage(r, buffer);
+
+	/* Check if it was not moved */
+	if (child->downlinkoffnum != InvalidOffsetNumber)
+	{
+		iid = PageGetItemId(page, child->downlinkoffnum);
+		idxtuple = (IndexTuple) PageGetItem(page, iid);
+		if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == child->blkno)
+		{
+			/* Still there */
+			UnlockReleaseBuffer(buffer);
+			return;
+		}
+	}
+
+	/* parent is changed, look child in right links until found */
+	while (true)
+	{
+		/* Search for relevant downlink in the current page */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+		{
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+			if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == child->blkno)
+			{
+				/* yes!!, found */
+				child->downlinkoffnum = i;
+				UnlockReleaseBuffer(buffer);
+				return;
+			}
+		}
+
+		/*
+		 * We should copy parent path item because some other path items can
+		 * refer to it.
+		 */
+		if (!copied)
+		{
+			parent = (GISTBufferingInsertStack *) MemoryContextAlloc(gfbb->context,
+										   sizeof(GISTBufferingInsertStack));
+			memcpy(parent, child->parent, sizeof(GISTBufferingInsertStack));
+			if (parent->parent)
+				parent->parent->refCount++;
+			gistDecreasePathRefcount(child->parent);
+			child->parent = parent;
+			parent->refCount = 1;
+			copied = true;
+		}
+
+		/*
+		 * Not found in current page. Move towards rightlink.
+		 */
+		parent->blkno = GistPageGetOpaque(page)->rightlink;
+		UnlockReleaseBuffer(buffer);
+
+		if (parent->blkno == InvalidBlockNumber)
+		{
+			/*
+			 * End of chain and still didn't find parent. Should not happen
+			 * during index build.
+			 */
+			break;
+		}
+
+		/* Get the next page */
+		buffer = ReadBuffer(r, parent->blkno);
+		page = BufferGetPage(buffer);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistcheckpage(r, buffer);
+	}
+
+	elog(ERROR, "failed to re-find parent for block %u", child->blkno);
+}
+
+/*
+ * Process buffers emptying stack. Emptying of one buffer can cause emptying
+ * of other buffers. This function iterates until this cascading emptying
+ * process finished, e.g. until buffers emptying stack is empty.
+ */
+static void
+gistProcessEmptyingStack(GISTSTATE *giststate, GISTInsertState *state)
+{
+	GISTBuildBuffers *gfbb = giststate->gfbb;
+
+	/* Iterate while we have elements in buffers emptying stack. */
+	while (gfbb->bufferEmptyingQueue != NIL)
+	{
+		GISTNodeBuffer *emptyingNodeBuffer;
+
+		/* Get node buffer from emptying stack. */
+		emptyingNodeBuffer = (GISTNodeBuffer *) linitial(gfbb->bufferEmptyingQueue);
+		gfbb->bufferEmptyingQueue = list_delete_first(gfbb->bufferEmptyingQueue);
+		emptyingNodeBuffer->queuedForEmptying = false;
+
+		/*
+		 * We are going to load last pages of buffers where emptying will be
+		 * to. So let's unload any previously loaded buffers.
+		 */
+		gistUnloadNodeBuffers(gfbb);
+
+		/* Variables for split of current emptying buffer detection. */
+		gfbb->currentEmptyingBufferSplit = false;
+		gfbb->currentEmptyingBufferBlockNumber = emptyingNodeBuffer->nodeBlocknum;
+
+		while (true)
+		{
+			IndexTuple	itup;
+
+			/* Get the next one index tuple from node buffer */
+			if (!gistPopItupFromNodeBuffer(gfbb, emptyingNodeBuffer, &itup))
+				break;
+
+			/* Run it down to the underlying node buffer or leaf page */
+			if (gistProcessItup(giststate, state, gfbb, itup, emptyingNodeBuffer->path))
+				break;
+
+			/* Free all the memory allocated during index tuple processing */
+			MemoryContextReset(CurrentMemoryContext);
+
+			/*
+			 * If current emptying node buffer split we should stop emptying
+			 * just because there is no such node buffer anymore.
+			 */
+			if (gfbb->currentEmptyingBufferSplit)
+				break;
+		}
+	}
+}
+
+/*
+ * Insert function for buffering index build.
+ */
+static void
+gistBufferingBuildInsert(Relation index, IndexTuple itup,
+						 GISTBuildState *buildstate)
+{
+	GISTBuildBuffers *gfbb = buildstate->giststate.gfbb;
+	GISTInsertState insertstate;
+
+	memset(&insertstate, 0, sizeof(GISTInsertState));
+	insertstate.freespace = buildstate->freespace;
+	insertstate.r = index;
+
+	/* We are ready for index tuple processing */
+	gistProcessItup(&buildstate->giststate, &insertstate, gfbb, itup, NULL);
+
+	/* Process buffer emptying stack if any */
+	gistProcessEmptyingStack(&buildstate->giststate, &insertstate);
+}
+
+/*
+ * Per-tuple callback from IndexBuildHeapScan.
+ */
+static void
+gistBuildCallback(Relation index,
+				  HeapTuple htup,
+				  Datum *values,
+				  bool *isnull,
+				  bool tupleIsAlive,
+				  void *state)
+{
+	GISTBuildState *buildstate = (GISTBuildState *) state;
+	IndexTuple	itup;
+	MemoryContext oldCtx;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	/* form an index tuple and point it at the heap tuple */
+	itup = gistFormTuple(&buildstate->giststate, index, values, isnull, true);
+	itup->t_tid = htup->t_self;
+
+	if (buildstate->bufferingMode == GIST_BUFFERING_ACTIVE)
+	{
+		/* We have buffers, so use them. */
+		gistBufferingBuildInsert(index, itup, buildstate);
+	}
+	else
+	{
+		/*
+		 * There's no buffers (yet). Since we already have the index relation
+		 * locked, we call gistdoinsert directly.
+		 *
+		 * In this path we respect the fillfactor setting, whereas insertions
+		 * after initial build do not.
+		 */
+		gistdoinsert(index, itup, buildstate->freespace,
+					 &buildstate->giststate);
+	}
+
+	/* Increase statistics of index tuples count and their summary size. */
+	buildstate->indtuples += 1;
+	buildstate->indtuplesSize += IndexTupleSize(itup);
+
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(buildstate->tmpCtx);
+
+	if (buildstate->bufferingMode == GIST_BUFFERING_ACTIVE &&
+		buildstate->indtuples % BUFFERING_MODE_TUPLE_SIZE_STATS_TARGET == 0)
+	{
+		/* We've to adjust buffers size now */
+		buildstate->giststate.gfbb->pagesPerBuffer =
+			calculatePagesPerBuffer(buildstate, index,
+									buildstate->giststate.gfbb->levelStep);
+	}
+
+	/*
+	 * For automatic switching to buffering mode, check whether index fits to
+	 * effective cache. We call smgrnblocks only each
+	 * BUFFERING_MODE_SWITCH_CHECK_STEP index tuples because frequent
+	 * smgrnblocks calls can be expensive.
+	 */
+	if ((buildstate->bufferingMode == GIST_BUFFERING_AUTO &&
+		 buildstate->indtuples % BUFFERING_MODE_SWITCH_CHECK_STEP == 0 &&
+		 effective_cache_size < smgrnblocks(index->rd_smgr, MAIN_FORKNUM)) ||
+		(buildstate->bufferingMode == GIST_BUFFERING_STATS &&
+		 buildstate->indtuples >= BUFFERING_MODE_TUPLE_SIZE_STATS_TARGET))
+	{
+		/*
+		 * Index doesn't fit to effective cache anymore. Trying to switch to
+		 * buffering build mode.
+		 */
+		if (gistInitBuffering(buildstate, index))
+		{
+			/*
+			 * Buffering build is successfully initialized. Now we can set
+			 * appropriate flag.
+			 */
+			buildstate->bufferingMode = GIST_BUFFERING_ACTIVE;
+		}
+		else
+		{
+			/*
+			 * Failed to switch to buffering build due to not enough memory
+			 * settings. Mark that we aren't going to switch anymore.
+			 */
+			buildstate->bufferingMode = GIST_BUFFERING_DISABLED;
+		}
+	}
+}
+
+/*
+ * Calculate pagesPerBuffer parameter for the buffering algorithm.
+ *
+ * Buffer size is chosen so that assuming that tuples are distributed
+ * randomly, emptying half a buffer fills on average one page in every buffer
+ * at the next lower level.
+ */
+static int
+calculatePagesPerBuffer(GISTBuildState *buildstate, Relation index,
+						int levelStep)
+{
+	double		pagesPerBuffer;
+	double		avgIndexTuplesPerPage;
+	double		itupAvgSize;
+	Size		pageFreeSpace;
+
+	/* Calc space of index page which is available for index tuples */
+	pageFreeSpace = BLCKSZ - SizeOfPageHeaderData - sizeof(GISTPageOpaqueData)
+		- sizeof(ItemIdData)
+		- buildstate->freespace;
+
+	/*
+	 * Calculate average size of already inserted index tuples using
+	 * gathered statistics.
+	 */
+	itupAvgSize = (double) buildstate->indtuplesSize /
+				  (double) buildstate->indtuples;
+
+	avgIndexTuplesPerPage = pageFreeSpace / itupAvgSize;
+
+	/*
+	 * Recalculate required size of buffers.
+	 */
+	pagesPerBuffer = 2 * pow(avgIndexTuplesPerPage, levelStep);
+
+	return round(pagesPerBuffer);
+}
+
+
+/*
+ * Get maximum level number of GiST index. Scans tree from root until meets
+ * leaf page choosing first link in each page.
+ */
+static int
+gistGetMaxLevel(Relation index)
+{
+	int			maxLevel = 0;
+	BlockNumber blkno = GIST_ROOT_BLKNO;
+
+	while (true)
+	{
+		Buffer		buffer;
+		Page		page;
+		IndexTuple	itup;
+
+		/* Read page */
+		buffer = ReadBuffer(index, blkno);
+		page = (Page) BufferGetPage(buffer);
+
+		/* Is it a leaf page? */
+		if (GistPageIsLeaf(page))
+		{
+			/* Page is leaf. We've counted height of tree. */
+			ReleaseBuffer(buffer);
+			break;
+		}
+
+		/*
+		 * Page is not leaf. Iterate to underlying page using first link of
+		 * it.
+		 */
+		itup = (IndexTuple) PageGetItem(page,
+									 PageGetItemId(page, FirstOffsetNumber));
+		blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+		ReleaseBuffer(buffer);
+
+		/*
+		 * We're going down on the tree. It means that there is yet one more
+		 * level is the tree.
+		 */
+		maxLevel++;
+	}
+	return maxLevel;
+}
+
+/*
+ * Initial calculations for GiST buffering build.
+ */
+static bool
+gistInitBuffering(GISTBuildState *buildstate, Relation index)
+{
+	int			pagesPerBuffer;
+	Size		pageFreeSpace;
+	Size		itupAvgSize,
+				itupMinSize;
+	double		avgIndexTuplesPerPage,
+				maxIndexTuplesPerPage;
+	int			i;
+	int			levelStep;
+	GISTBuildBuffers *gfbb;
+
+	/* Calc space of index page which is available for index tuples */
+	pageFreeSpace = BLCKSZ - SizeOfPageHeaderData - sizeof(GISTPageOpaqueData)
+		- sizeof(ItemIdData)
+		- buildstate->freespace;
+
+	/*
+	 * Calculate average size of already inserted index tuples using gathered
+	 * statistics.
+	 */
+	itupAvgSize = (double) buildstate->indtuplesSize /
+				  (double) buildstate->indtuples;
+
+	/*
+	 * Calculate minimal possible size of index tuple by index metadata.
+	 * Minimal possible size of varlena is VARHDRSZ.
+	 *
+	 * XXX: that's not actually true, as a short varlen can be just 2 bytes.
+	 * And we should take padding into account here.
+	 */
+	itupMinSize = (Size) MAXALIGN(sizeof(IndexTupleData));
+	for (i = 0; i < index->rd_att->natts; i++)
+	{
+		if (index->rd_att->attrs[i]->attlen < 0)
+			itupMinSize += VARHDRSZ;
+		else
+			itupMinSize += index->rd_att->attrs[i]->attlen;
+	}
+
+	/* Calculate average and maximal number of index tuples which fit to page */
+	avgIndexTuplesPerPage = pageFreeSpace / itupAvgSize;
+	maxIndexTuplesPerPage = pageFreeSpace / itupMinSize;
+
+	/*
+	 * We need to calculate two parameters for the buffering algorithm:
+	 * levelStep and pagesPerBuffer.
+	 *
+	 * levelStep determines the size of subtree that we operate on, while
+	 * emptying a buffer. A higher value is better, as you need fewer buffer
+	 * emptying steps to perform the index build. However, if you set it too
+	 * high, the subtree doesn't fit in cache anymore, and you quickly lose
+	 * the benefit of the buffers.
+	 *
+	 * In Arge et al's paper, levelStep is chosen as logB(M/4B), where B is
+	 * the number of tuples on page (ie. fanout), and M is the amount of
+	 * internal memory available. Curiously, they doesn't explain *why* that
+	 * setting is optimal. We calculate it by taking the highest levelStep
+	 * so that a subtree still fits in cache. For a small B, our way of
+	 * calculating levelStep is very close to Arge et al's formula. For a
+	 * large B, our formula gives a value that is 2x higher.
+	 *
+	 * The average size of a subtree of depth n can be calculated as a
+	 * geometric series:
+	 *
+	 *		B^0 + B^1 + B^2 + ... + B^n = (1 - B^(n + 1)) / (1 - B)
+	 *
+	 * where B is the average number of index tuples on page. The subtree is
+	 * cached in the shared buffer cache and the OS cache, so we choose
+	 * levelStep so that the subtree size is comfortably smaller than
+	 * effective_cache_size, with a safety factor of 4.
+	 *
+	 * The estimate on the average number of index tuples on page is based on
+	 * average tuple sizes observed before switching to buffered build, so the
+	 * real subtree size can be somewhat larger. Also, it would selfish to
+	 * gobble the whole cache for our index build. The safety factor of 4
+	 * should account for those effects.
+	 *
+	 * The other limiting factor for setting levelStep is that while
+	 * processing a subtree, we need to hold one page for each buffer at the
+	 * next lower buffered level. The max. number of buffers needed for that
+	 * is maxIndexTuplesPerPage^levelStep. This is very conservative, but
+	 * hopefully maintenance_work_mem is set high enough that you're
+	 * constrained by effective_cache_size rather than maintenance_work_mem.
+	 *
+	 * XXX: the buffer hash table consumes a fair amount of memory too per
+	 * buffer, but that is not currently taken into account. That scales on
+	 * the total number of buffers used, ie. the index size and on levelStep.
+	 * Note that a higher levelStep *reduces* the amount of memory needed for
+	 * the hash table.
+	 */
+	levelStep = 0;
+	while (
+		/* subtree must fit in cache (with safety factor of 4) */
+		(1 - pow(avgIndexTuplesPerPage, (double) (levelStep + 1))) / (1 - avgIndexTuplesPerPage) < effective_cache_size / 4
+		&&
+		/* each node in the lowest level of a subtree has one page in memory */
+		(pow(maxIndexTuplesPerPage, (double) levelStep) < (maintenance_work_mem * 1024) / BLCKSZ)
+		)
+	{
+		levelStep++;
+	}
+
+	/*
+	 * If there's not enough cache or maintenance_work_mem, fall back to plain
+	 * inserts.
+	 */
+	if (levelStep <= 0)
+	{
+		elog(DEBUG1, "failed to switch to buffered GiST build");
+		return false;
+	}
+
+	/*
+	 * The second parameter to set is pagesPerBuffer, which determines the
+	 * size of each buffer. We adjust pagesPerBuffer also during the build,
+	 * which is why this calculation is in a separate function.
+	 */
+	pagesPerBuffer = calculatePagesPerBuffer(buildstate, index, levelStep);
+
+	elog(DEBUG1, "switching to buffered GiST build; level step = %d, pagesPerBuffer = %d",
+		 levelStep, pagesPerBuffer);
+
+	/* Initialize GISTBuildBuffers with these parameters */
+	gfbb = palloc(sizeof(GISTBuildBuffers));
+	gfbb->pagesPerBuffer = pagesPerBuffer;
+	gfbb->levelStep = levelStep;
+	gistInitBuildBuffers(gfbb, gistGetMaxLevel(index));
+
+	buildstate->giststate.gfbb = gfbb;
+
+	return true;
+}
+
+/*
+ * Routine to build an index.  Basically calls insert over and over.
+ *
+ * XXX: it would be nice to implement some sort of bulk-loading
+ * algorithm, but it is not clear how to do that.
+ */
+Datum
+gistbuild(PG_FUNCTION_ARGS)
+{
+	Relation	heap = (Relation) PG_GETARG_POINTER(0);
+	Relation	index = (Relation) PG_GETARG_POINTER(1);
+	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+	IndexBuildResult *result;
+	double		reltuples;
+	GISTBuildState buildstate;
+	Buffer		buffer;
+	Page		page;
+	MemoryContext oldcxt = CurrentMemoryContext;
+
+	buildstate.freespace = RelationGetTargetPageFreeSpace(index,
+													  GIST_DEFAULT_FILLFACTOR);
+
+	if (index->rd_options)
+	{
+		/* Get buffering mode from the options string */
+		GiSTOptions *options = (GiSTOptions *) index->rd_options;
+		char	   *bufferingMode = (char *) options + options->bufferingModeOffset;
+
+		if (strcmp(bufferingMode, "on") != 0)
+			buildstate.bufferingMode = GIST_BUFFERING_STATS;
+		else if (strcmp(bufferingMode, "off") != 0)
+			buildstate.bufferingMode = GIST_BUFFERING_DISABLED;
+		else
+			buildstate.bufferingMode = GIST_BUFFERING_AUTO;
+	}
+	else
+	{
+		/* Automatic buffering mode switching by default */
+		buildstate.bufferingMode = GIST_BUFFERING_AUTO;
+	}
+
+	/*
+	 * We expect to be called exactly once for any index relation. If that's
+	 * not the case, big trouble's what we have.
+	 */
+	if (RelationGetNumberOfBlocks(index) != 0)
+		elog(ERROR, "index \"%s\" already contains data",
+			 RelationGetRelationName(index));
+
+	/* no locking is needed */
+	initGISTstate(&buildstate.giststate, index);
+
+	/* initialize the root page */
+	buffer = gistNewBuffer(index);
+	Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
+	page = BufferGetPage(buffer);
+
+	START_CRIT_SECTION();
+
+	GISTInitBuffer(buffer, F_LEAF);
+
+	MarkBufferDirty(buffer);
+
+	if (RelationNeedsWAL(index))
+	{
+		XLogRecPtr	recptr;
+		XLogRecData rdata;
+
+		rdata.data = (char *) &(index->rd_node);
+		rdata.len = sizeof(RelFileNode);
+		rdata.buffer = InvalidBuffer;
+		rdata.next = NULL;
+
+		recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_CREATE_INDEX, &rdata);
+		PageSetLSN(page, recptr);
+		PageSetTLI(page, ThisTimeLineID);
+	}
+	else
+		PageSetLSN(page, GetXLogRecPtrForTemp());
+
+	UnlockReleaseBuffer(buffer);
+
+	END_CRIT_SECTION();
+
+	/* build the index */
+	buildstate.indtuples = 0;
+	buildstate.indtuplesSize = 0;
+
+	/*
+	 * create a temporary memory context that is reset once for each tuple
+	 * inserted into the index
+	 */
+	buildstate.tmpCtx = createTempGistContext();
+
+	/*
+	 * Do the heap scan.
+	 */
+	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
+								   gistBuildCallback, (void *) &buildstate);
+
+	/*
+	 * If buffering build do final node buffers emptying.
+	 */
+	if (buildstate.bufferingMode == GIST_BUFFERING_ACTIVE)
+	{
+		int			i;
+		GISTInsertState insertstate;
+		GISTNodeBuffer *nodeBuffer;
+		MemoryContext oldCtx;
+		GISTBuildBuffers *gfbb = buildstate.giststate.gfbb;
+
+		elog(DEBUG1, "all tuples processed, emptying buffers");
+
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+
+		memset(&insertstate, 0, sizeof(GISTInsertState));
+		insertstate.freespace = buildstate.freespace;
+		insertstate.r = index;
+
+		/*
+		 * Iterate through the levels from the most higher.
+		 */
+		for (i = gfbb->buffersOnLevelsCount - 1; i >= 0; i--)
+		{
+			bool		nonEmpty = true;
+
+			/*
+			 * Until we have non-empty node buffers on the level, iterate over
+			 * them and initial emptying of non-empty ones.
+			 */
+			while (nonEmpty)
+			{
+				ListCell   *p;
+
+				nonEmpty = false;
+
+				for (p = list_head(gfbb->buffersOnLevels[i]); p; p = p->next)
+				{
+					bool		isRoot;
+
+					/* Get next node buffer */
+					nodeBuffer = (GISTNodeBuffer *) p->data.ptr_value;
+					isRoot = (nodeBuffer->nodeBlocknum == GIST_ROOT_BLKNO);
+
+					/* Skip empty node buffer */
+					if (nodeBuffer->blocksCount == 0)
+						continue;
+
+					/* Memorize that we saw a non-empty buffer. */
+					nonEmpty = true;
+
+					/* Process emptying of node buffer */
+					MemoryContextSwitchTo(gfbb->context);
+					gfbb->bufferEmptyingQueue = lcons(nodeBuffer, gfbb->bufferEmptyingQueue);
+					MemoryContextSwitchTo(buildstate.tmpCtx);
+					gistProcessEmptyingStack(&buildstate.giststate, &insertstate);
+
+					/*
+					 * Root page node buffer is the only node buffer that can
+					 * be deleted from the list. So, let's be careful and
+					 * restart the scan.
+					 */
+					if (isRoot)
+						break;
+				}
+			}
+		}
+		MemoryContextSwitchTo(oldCtx);
+	}
+
+	/* okay, all heap tuples are indexed */
+	MemoryContextSwitchTo(oldcxt);
+	MemoryContextDelete(buildstate.tmpCtx);
+
+	freeGISTstate(&buildstate.giststate);
+
+	/*
+	 * Return statistics
+	 */
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+	result->heap_tuples = reltuples;
+	result->index_tuples = (double) buildstate.indtuples;
+
+	PG_RETURN_POINTER(result);
+}
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
new file mode 100644
index 0000000..d0c124d
--- /dev/null
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -0,0 +1,910 @@
+/*-------------------------------------------------------------------------
+ *
+ * gistbuildbuffers.c
+ *	  buffers management functions for GiST buffering build algorithm.
+ *
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/gist/gistbuildbuffers.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/gist_private.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "miscadmin.h"
+#include "storage/buffile.h"
+#include "storage/bufmgr.h"
+#include "storage/indexfsm.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+static GISTNodeBufferPage *gistAllocateNewPageBuffer(GISTBuildBuffers *gfbb);
+static void gistAddLoadedBuffer(GISTBuildBuffers *gfbb, BlockNumber blocknum);
+static void gistLoadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer);
+static void gistUnloadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer);
+static void gistPlaceItupToPage(GISTNodeBufferPage *pageBuffer, IndexTuple item);
+static void gistGetItupFromPage(GISTNodeBufferPage *pageBuffer, IndexTuple *item);
+static int	gistBuffersFreeBlocksCmp(const void *a, const void *b);
+static long gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb);
+static void gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, long blocknum);
+
+/*
+ * Initialize GiST buffering build data structure.
+ */
+void
+gistInitBuildBuffers(GISTBuildBuffers *gfbb, int maxLevel)
+{
+	HASHCTL		hashCtl;
+
+	/*
+	 * Create temporary file initialize data structures for free pages
+	 * management.
+	 */
+	gfbb->pfile = BufFileCreateTemp(true);
+	gfbb->nFileBlocks = 0;
+	gfbb->nFreeBlocks = 0;
+	gfbb->blocksSorted = false;
+	gfbb->freeBlocksLen = 32;
+	gfbb->freeBlocks = (long *) palloc(gfbb->freeBlocksLen * sizeof(long));
+
+	/*
+	 * Current memory context will be used for all in-memory data structures
+	 * of buffers which are persistent during buffering build.
+	 */
+	gfbb->context = CurrentMemoryContext;
+
+	/*
+	 * nodeBuffersTab hash is association between index blocks and it's
+	 * buffers.
+	 */
+	hashCtl.keysize = sizeof(BlockNumber);
+	hashCtl.entrysize = sizeof(GISTNodeBuffer);
+	hashCtl.hcxt = CurrentMemoryContext;
+	hashCtl.hash = tag_hash;
+	hashCtl.match = memcmp;
+	gfbb->nodeBuffersTab = hash_create("gistbuildbuffers",
+									   1024,
+									   &hashCtl,
+									 HASH_ELEM | HASH_CONTEXT | HASH_FUNCTION
+									   | HASH_COMPARE);
+
+	/*
+	 * Stack of node buffers which was planned for emptying.
+	 */
+	gfbb->bufferEmptyingQueue = NIL;
+
+	gfbb->currentEmptyingBufferBlockNumber = InvalidBlockNumber;
+	gfbb->currentEmptyingBufferSplit = false;
+
+	/*
+	 * Per-level node buffers lists for final buffers emptying process. Node
+	 * buffer are inserted here when it is created. Root node buffer is the
+	 * only buffer which can be deleted from appropriate list, because after
+	 * split root node appears at higher level but saves block number.
+	 */
+	gfbb->buffersOnLevelsLen = 16;
+	gfbb->buffersOnLevels = (List **) palloc(sizeof(List *) *
+											 gfbb->buffersOnLevelsLen);
+	gfbb->buffersOnLevelsCount = 0;
+
+	/*
+	 * Block numbers of node buffers which last pages are currently loaded
+	 * into main memory.
+	 */
+	gfbb->loadedBuffersLen = 32;
+	gfbb->loadedBuffers = (BlockNumber *) palloc(gfbb->loadedBuffersLen *
+												 sizeof(BlockNumber));
+	gfbb->loadedBuffersCount = 0;
+
+	/*
+	 * Root path item of the tree. Being updated on each root node split.
+	 */
+	gfbb->rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc(
+							gfbb->context, sizeof(GISTBufferingInsertStack));
+	gfbb->rootitem->parent = NULL;
+	gfbb->rootitem->blkno = GIST_ROOT_BLKNO;
+	gfbb->rootitem->downlinkoffnum = InvalidOffsetNumber;
+	gfbb->rootitem->level = maxLevel;
+	gfbb->rootitem->refCount = 1;
+}
+
+/*
+ * Returns a node buffer by its block number. If createNew flag is specified
+ * then new NodeBuffer structure will be created on it's absence.
+ */
+GISTNodeBuffer *
+gistGetNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
+				  BlockNumber nodeBlocknum,
+				  OffsetNumber downlinkoffnum,
+				  GISTBufferingInsertStack *parent, bool createNew)
+{
+	GISTNodeBuffer *nodeBuffer;
+	bool		found;
+
+	/*
+	 * Find nodeBuffer in hash table
+	 */
+	nodeBuffer = (GISTNodeBuffer *) hash_search(gfbb->nodeBuffersTab,
+												(const void *) &nodeBlocknum,
+										  createNew ? HASH_ENTER : HASH_FIND,
+												&found);
+	if (!found)
+	{
+		GISTBufferingInsertStack *path;
+		int			levelIndex;
+		int			i;
+		MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
+
+		/*
+		 * Node buffer wasn't found. Create new if required.
+		 */
+		if (!createNew)
+			return NULL;
+
+		if (nodeBlocknum != GIST_ROOT_BLKNO)
+		{
+			/*
+			 * For non-root page we have to create new path item which
+			 * references to the given parent.
+			 */
+			path = (GISTBufferingInsertStack *) palloc(
+										   sizeof(GISTBufferingInsertStack));
+			path->parent = parent;
+			path->blkno = nodeBlocknum;
+			path->downlinkoffnum = downlinkoffnum;
+			path->level = parent->level - 1;
+			path->refCount = 0;
+			parent->refCount++;
+			Assert(path->level > 0);
+		}
+		else
+		{
+			path = gfbb->rootitem;
+		}
+
+		/* Node buffer references it's path item. */
+		path->refCount++;
+
+		/*
+		 * New node buffer. Fill data structure with default values.
+		 */
+		nodeBuffer->pageBuffer = NULL;
+		nodeBuffer->blocksCount = 0;
+		nodeBuffer->path = path;
+		nodeBuffer->queuedForEmptying = false;
+
+		/*
+		 * Put node buffer to the appropriate list. Calc index of node buffer
+		 * list by it's level.
+		 */
+		levelIndex = (path->level - gfbb->levelStep) / gfbb->levelStep;
+
+		/*
+		 * Probably, we should increase number of allocated buffers lists.
+		 */
+		while (levelIndex >= gfbb->buffersOnLevelsLen)
+		{
+			gfbb->buffersOnLevelsLen *= 2;
+			gfbb->buffersOnLevels =
+				(List **) repalloc(gfbb->buffersOnLevels,
+								   gfbb->buffersOnLevelsLen *
+								   sizeof(List *));
+		}
+
+		/* Initialize new buffers lists as empty. */
+		if (levelIndex >= gfbb->buffersOnLevelsCount)
+		{
+			for (i = gfbb->buffersOnLevelsCount; i <= levelIndex; i++)
+				gfbb->buffersOnLevels[i] = NIL;
+			gfbb->buffersOnLevelsCount = levelIndex + 1;
+		}
+
+		/* Add node buffer to the corresponding list */
+		gfbb->buffersOnLevels[levelIndex] = lcons(
+							  nodeBuffer, gfbb->buffersOnLevels[levelIndex]);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		if (parent != nodeBuffer->path->parent)
+		{
+			/*
+			 * Other parent path item was provided than we've remembered. We
+			 * trust caller to provide more correct parent than we have.
+			 * Previous parent may be outdated by page split.
+			 */
+			gistDecreasePathRefcount(nodeBuffer->path->parent);
+			nodeBuffer->path->parent = parent;
+			parent->refCount++;
+		}
+	}
+
+	return nodeBuffer;
+}
+
+/*
+ * Allocate memory for buffer page.
+ */
+static GISTNodeBufferPage *
+gistAllocateNewPageBuffer(GISTBuildBuffers * gfbb)
+{
+	GISTNodeBufferPage *pageBuffer;
+
+	/*
+	 * Allocate memory for page in appropriate context.
+	 */
+	pageBuffer = (GISTNodeBufferPage *) MemoryContextAlloc(gfbb->context, BLCKSZ);
+
+	/*
+	 * Set page free space
+	 */
+	PAGE_FREE_SPACE(pageBuffer) = BLCKSZ - BUFFER_PAGE_DATA_OFFSET;
+	return pageBuffer;
+}
+
+/*
+ * Add specified block number into preparedBlocks array.
+ */
+static void
+gistAddLoadedBuffer(GISTBuildBuffers * gfbb, BlockNumber blocknum)
+{
+	if (gfbb->loadedBuffersCount >= gfbb->loadedBuffersLen)
+	{
+		/*
+		 * Not enough of memory is currently allocated.
+		 */
+		gfbb->loadedBuffersLen *= 2;
+		gfbb->loadedBuffers = (BlockNumber *) repalloc(gfbb->loadedBuffers,
+													 gfbb->loadedBuffersLen *
+													   sizeof(BlockNumber));
+	}
+	/* Actual add to array */
+	gfbb->loadedBuffers[gfbb->loadedBuffersCount] = blocknum;
+	gfbb->loadedBuffersCount++;
+}
+
+
+/*
+ * Load last page of node buffer into main memory.
+ */
+static void
+gistLoadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer)
+{
+	/* Check if we really should load something */
+	if (!nodeBuffer->pageBuffer && nodeBuffer->blocksCount > 0)
+	{
+		/* Allocate memory for page */
+		nodeBuffer->pageBuffer = gistAllocateNewPageBuffer(gfbb);
+
+		/* Read block from temporary file */
+		BufFileSeekBlock(gfbb->pfile, nodeBuffer->pageBlocknum);
+		BufFileRead(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+
+		/* Mark file block as free */
+		gistBuffersReleaseBlock(gfbb, nodeBuffer->pageBlocknum);
+
+		/* Mark node buffer as loaded */
+		gistAddLoadedBuffer(gfbb, nodeBuffer->nodeBlocknum);
+		nodeBuffer->pageBlocknum = InvalidBlockNumber;
+	}
+}
+
+/*
+ * Write last page of node buffer to the disk.
+ */
+static void
+gistUnloadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer)
+{
+	/* Check if we have something to write */
+	if (nodeBuffer->pageBuffer)
+	{
+		BlockNumber blkno;
+
+		/* Get free file block */
+		blkno = gistBuffersGetFreeBlock(gfbb);
+
+		/* Write block to the temporary file */
+		BufFileSeekBlock(gfbb->pfile, blkno);
+		BufFileWrite(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+
+		/* Free memory of that page */
+		pfree(nodeBuffer->pageBuffer);
+		nodeBuffer->pageBuffer = NULL;
+
+		/* Save block number */
+		nodeBuffer->pageBlocknum = blkno;
+	}
+}
+
+/*
+ * Write last pages of all node buffers to the disk.
+ */
+void
+gistUnloadNodeBuffers(GISTBuildBuffers *gfbb)
+{
+	int			i;
+
+	/* Iterate over node buffers which last page is loaded into main memory */
+	for (i = 0; i < gfbb->loadedBuffersCount; i++)
+	{
+		GISTNodeBuffer *nodeBuffer;
+		bool		found;
+
+		/* Find node buffer by it's block number */
+		nodeBuffer = hash_search(gfbb->nodeBuffersTab, &gfbb->loadedBuffers[i],
+								 HASH_FIND, &found);
+
+		/*
+		 * Node buffer can be not found. It can disappear during page split.
+		 * So, it's ok, just skip it.
+		 */
+		if (!found)
+			continue;
+
+		/* Unload last page to the disk */
+		gistUnloadNodeBuffer(gfbb, nodeBuffer);
+	}
+	/* Now there are no node buffers with loaded last page */
+	gfbb->loadedBuffersCount = 0;
+}
+
+/*
+ * Add index tuple to buffer page.
+ */
+static void
+gistPlaceItupToPage(GISTNodeBufferPage *pageBuffer, IndexTuple itup)
+{
+	/*
+	 * Get pointer to the start of page free space
+	 */
+	char	   *ptr = (char *) pageBuffer + BUFFER_PAGE_DATA_OFFSET
+	+ PAGE_FREE_SPACE(pageBuffer) - MAXALIGN(IndexTupleSize(itup));
+
+	/*
+	 * There should be enough of space
+	 */
+	Assert(PAGE_FREE_SPACE(pageBuffer) >= MAXALIGN(IndexTupleSize(itup)));
+
+	/*
+	 * Reduce free space value of page
+	 */
+	PAGE_FREE_SPACE(pageBuffer) -= MAXALIGN(IndexTupleSize(itup));
+
+	/*
+	 * Copy index tuple to free space
+	 */
+	memcpy(ptr, itup, IndexTupleSize(itup));
+}
+
+/*
+ * Get last item from buffer page and remove it from page.
+ */
+static void
+gistGetItupFromPage(GISTNodeBufferPage *pageBuffer, IndexTuple *itup)
+{
+	/*
+	 * Get pointer to last index tuple
+	 */
+	IndexTuple	ptr = (IndexTuple) ((char *) pageBuffer
+									+ BUFFER_PAGE_DATA_OFFSET
+									+ PAGE_FREE_SPACE(pageBuffer));
+
+	/*
+	 * Page shouldn't be empty
+	 */
+	Assert(!PAGE_IS_EMPTY(pageBuffer));
+
+	/*
+	 * Allocate memory for returned index tuple copy
+	 */
+	*itup = (IndexTuple) palloc(IndexTupleSize(ptr));
+
+	/*
+	 * Copy data
+	 */
+	memcpy(*itup, ptr, IndexTupleSize(ptr));
+
+	/*
+	 * Increase free space value of page
+	 */
+	PAGE_FREE_SPACE(pageBuffer) += MAXALIGN(IndexTupleSize(*itup));
+}
+
+/*
+ * Push new index tuple to node buffer.
+ */
+void
+gistPushItupToNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer,
+						 IndexTuple itup)
+{
+	/*
+	 * Most part of memory operations will be in buffering build persistent
+	 * context. So, let's switch to it.
+	 */
+	MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
+
+	/* Is the buffer currently empty? */
+	if (nodeBuffer->blocksCount == 0)
+	{
+		/* It's empty, let's create the first page */
+		nodeBuffer->pageBuffer = gistAllocateNewPageBuffer(gfbb);
+		nodeBuffer->pageBuffer->prev = InvalidBlockNumber;
+		nodeBuffer->blocksCount = 1;
+		gistAddLoadedBuffer(gfbb, nodeBuffer->nodeBlocknum);
+	}
+
+	/* Load last page of node buffer if it wasn't already */
+	if (!nodeBuffer->pageBuffer)
+	{
+		gistLoadNodeBuffer(gfbb, nodeBuffer);
+	}
+
+	/*
+	 * Check if there is enough space on the last page for the tuple
+	 */
+	if (PAGE_NO_SPACE(nodeBuffer->pageBuffer, itup))
+	{
+		/*
+		 * Swap previous block to disk and allocate new one
+		 */
+		BlockNumber blkno;
+
+		/* Write filled page to the disk */
+		blkno = gistBuffersGetFreeBlock(gfbb);
+		BufFileSeekBlock(gfbb->pfile, blkno);
+		BufFileWrite(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+
+		/* Mark space of in-memory page as empty */
+		PAGE_FREE_SPACE(nodeBuffer->pageBuffer) =
+			BLCKSZ - MAXALIGN(offsetof(GISTNodeBufferPage, tupledata));
+
+		/* Save block number of the previous page */
+		nodeBuffer->pageBuffer->prev = blkno;
+
+		/* We've just added one more page */
+		nodeBuffer->blocksCount++;
+	}
+
+	gistPlaceItupToPage(nodeBuffer->pageBuffer, itup);
+
+	/*
+	 * Restore memory context
+	 */
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * Removes one index tuple from node buffer. Returns true if success and false
+ * if node buffer is empty.
+ */
+bool
+gistPopItupFromNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer,
+						  IndexTuple *itup)
+{
+	/*
+	 * If node buffer is empty then return false.
+	 */
+	if (nodeBuffer->blocksCount <= 0)
+		return false;
+
+	/* Load last page of node buffer if needed */
+	if (!nodeBuffer->pageBuffer)
+		gistLoadNodeBuffer(gfbb, nodeBuffer);
+
+	/*
+	 * Get index tuple from last non-empty page.
+	 */
+	gistGetItupFromPage(nodeBuffer->pageBuffer, itup);
+
+	/*
+	 * Check if the page which the index tuple was got from is now empty
+	 */
+	if (PAGE_IS_EMPTY(nodeBuffer->pageBuffer))
+	{
+		BlockNumber prevblkno;
+
+		/*
+		 * If it's empty then we need to release buffer file block and free
+		 * page buffer.
+		 */
+		nodeBuffer->blocksCount--;
+
+		/*
+		 * If there's more pages, fetch previous one
+		 */
+		prevblkno = nodeBuffer->pageBuffer->prev;
+		if (prevblkno != InvalidBlockNumber)
+		{
+			/* There actually is previous page, so read it. */
+			Assert(nodeBuffer->blocksCount > 0);
+			BufFileSeekBlock(gfbb->pfile, prevblkno);
+			BufFileRead(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+
+			/* Mark block as free */
+			gistBuffersReleaseBlock(gfbb, prevblkno);
+		}
+		else
+		{
+			/* Actually there are no more pages. Free memory. */
+			Assert(nodeBuffer->blocksCount == 0);
+			pfree(nodeBuffer->pageBuffer);
+			nodeBuffer->pageBuffer = NULL;
+		}
+	}
+	return true;
+}
+
+/*
+ * qsort comparator for sorting freeBlocks[] into decreasing order.
+ */
+static int
+gistBuffersFreeBlocksCmp(const void *a, const void *b)
+{
+	long		ablk = *((const long *) a);
+	long		bblk = *((const long *) b);
+
+	/*
+	 * can't just subtract because long might be wider than int
+	 */
+	if (ablk < bblk)
+		return 1;
+	if (ablk > bblk)
+		return -1;
+	return 0;
+}
+
+/*
+ * Select a currently unused block for writing to.
+ *
+ * NB: should only be called when writer is ready to write immediately,
+ * to ensure that first write pass is sequential.
+ */
+static long
+gistBuffersGetFreeBlock(GISTBuildBuffers * gfbb)
+{
+	/*
+	 * If there are multiple free blocks, we select the one appearing last in
+	 * freeBlocks[] (after sorting the array if needed).  If there are none,
+	 * assign the next block at the end of the file.
+	 */
+	if (gfbb->nFreeBlocks > 0)
+	{
+		if (!gfbb->blocksSorted)
+		{
+			qsort((void *) gfbb->freeBlocks, gfbb->nFreeBlocks,
+				  sizeof(long), gistBuffersFreeBlocksCmp);
+			gfbb->blocksSorted = true;
+		}
+		return gfbb->freeBlocks[--gfbb->nFreeBlocks];
+	}
+	else
+		return gfbb->nFileBlocks++;
+}
+
+/*
+ * Return a block# to the freelist.
+ */
+static void
+gistBuffersReleaseBlock(GISTBuildBuffers * gfbb, long blocknum)
+{
+	int			ndx;
+
+	/*
+	 * Enlarge freeBlocks array if full.
+	 */
+	if (gfbb->nFreeBlocks >= gfbb->freeBlocksLen)
+	{
+		gfbb->freeBlocksLen *= 2;
+		gfbb->freeBlocks = (long *) repalloc(gfbb->freeBlocks,
+											 gfbb->freeBlocksLen *
+											 sizeof(long));
+	}
+
+	/*
+	 * Add blocknum to array, and mark the array unsorted if it's no longer in
+	 * decreasing order.
+	 */
+	ndx = gfbb->nFreeBlocks++;
+	gfbb->freeBlocks[ndx] = blocknum;
+	if (ndx > 0 && gfbb->freeBlocks[ndx - 1] < blocknum)
+		gfbb->blocksSorted = false;
+}
+
+/*
+ * Free buffering build data structure.
+ */
+void
+gistFreeBuildBuffers(GISTBuildBuffers * gfbb)
+{
+	/* Close buffers file. */
+	BufFileClose(gfbb->pfile);
+
+	/* All other things will be free on memory context release */
+}
+
+/*
+ * Data structure representing information about node buffer for index tuples
+ * relocation from splitted node buffer.
+ */
+typedef struct
+{
+	GISTENTRY	entry[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	SplitedPageLayout *dist;
+	GISTNodeBuffer *nodeBuffer;
+} RelocationBufferInfo;
+
+/*
+ * Maintain data structures on page split.
+ */
+void
+gistRelocateBuildBuffersOnSplit(GISTBuildBuffers * gfbb, GISTSTATE *giststate,
+								Relation r, GISTBufferingInsertStack *path,
+								Buffer buffer, SplitedPageLayout *dist)
+{
+	RelocationBufferInfo *relocationBuffersInfos;
+	bool		found;
+	GISTNodeBuffer *nodeBuffer;
+	BlockNumber blocknum;
+	IndexTuple	itup;
+	int			splitPagesCount = 0,
+				i;
+	GISTENTRY	entry[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	SplitedPageLayout *ptr;
+	int			level = path->level;
+	GISTNodeBuffer nodebuf;
+
+	blocknum = BufferGetBlockNumber(buffer);
+
+	/*
+	 * If this is a root split, update the root path item kept in memory. This
+	 * ensures that all path stacks are always complete, including all parent
+	 * nodes up to the root, which simplifies the algorithm to re-find correct
+	 * parent.
+	 */
+	if (blocknum == GIST_ROOT_BLKNO)
+	{
+		GISTBufferingInsertStack *oldroot = gfbb->rootitem;
+
+		gfbb->rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc(
+			gfbb->context, sizeof(GISTBufferingInsertStack));
+		gfbb->rootitem->parent = NULL;
+		gfbb->rootitem->blkno = GIST_ROOT_BLKNO;
+		gfbb->rootitem->downlinkoffnum = InvalidOffsetNumber;
+		gfbb->rootitem->level = oldroot->level + 1;
+		gfbb->rootitem->refCount = 1;
+
+		oldroot->parent = gfbb->rootitem;
+		oldroot->blkno = dist->block.blkno;
+		oldroot->downlinkoffnum = InvalidOffsetNumber;
+	}
+
+	/*
+	 * If splitted page level doesn't have buffers, then we've nothing to do
+	 * with it.
+	 */
+	if (!LEVEL_HAS_BUFFERS(level, gfbb))
+		return;
+
+	/*
+	 * Get pointer of node buffer of splitted page.
+	 */
+	nodeBuffer = hash_search(gfbb->nodeBuffersTab, &blocknum,
+							 HASH_FIND, &found);
+	if (!found)
+	{
+		/*
+		 * Node buffer should anyway be created at this moment. Either by
+		 * index tuples insertion or page split.
+		 */
+		elog(ERROR,
+		"node buffer of splitting page (%u) doesn't exists while it should.",
+			 blocknum);
+	}
+
+	/*
+	 * Make a copy of the old buffer, as we're going reuse the old one for
+	 * the buffer for the new left page, which is on the same block as the
+	 * old page. That's not true for the root page, but that's fine because
+	 * we never have a buffer on the root page anyway. The original algorithm
+	 * as described by Arge et al did, but it doesn't help as you might as
+	 * well read the tuples straight from the heap instead of the root buffer.
+	 */
+	Assert(blocknum != GIST_ROOT_BLKNO);
+	memcpy(&nodebuf, nodeBuffer, sizeof(GISTNodeBuffer));
+
+	/* Reassign pointer to the saved copy. */
+	nodeBuffer = &nodebuf;
+
+	/*
+	 * Count pages produced by split and save pointer data structure of the
+	 * last one.
+	 */
+	for (ptr = dist; ptr; ptr = ptr->next)
+		splitPagesCount++;
+
+	/*
+	 * Allocate memory for information about relocation buffers.
+	 */
+	relocationBuffersInfos =
+		(RelocationBufferInfo *) palloc(sizeof(RelocationBufferInfo) *
+										splitPagesCount);
+
+	/*
+	 * Fill relocation buffers information for node buffers of pages produced
+	 * by split.
+	 */
+	i = 0;
+	for (ptr = dist; ptr; ptr = ptr->next)
+	{
+		GISTNodeBuffer *newNodeBuffer;
+
+		/*
+		 * Decompress parent index tuple of node buffer page.
+		 */
+		gistDeCompressAtt(giststate, r,
+						  ptr->itup, NULL, (OffsetNumber) 0,
+						  relocationBuffersInfos[i].entry,
+						  relocationBuffersInfos[i].isnull);
+
+		newNodeBuffer = gistGetNodeBuffer(gfbb, giststate, ptr->block.blkno,
+								   path->downlinkoffnum, path->parent, true);
+
+		/*
+		 * Fill relocation information
+		 */
+		relocationBuffersInfos[i].nodeBuffer = newNodeBuffer;
+		if (newNodeBuffer->nodeBlocknum == blocknum)
+		{
+			/*
+			 * Reuse of GISTNodeBuffer data structure of splitted node. Old
+			 * version was copied.
+			 */
+			newNodeBuffer->blocksCount = 0;
+			newNodeBuffer->pageBuffer = NULL;
+			newNodeBuffer->pageBlocknum = InvalidBlockNumber;
+		}
+
+		/*
+		 * Fill node buffer structure
+		 */
+		relocationBuffersInfos[i].dist = ptr;
+
+		i++;
+	}
+
+	/*
+	 * Loop of index tuples relocation.
+	 */
+	while (gistPopItupFromNodeBuffer(gfbb, nodeBuffer, &itup))
+	{
+		float		sum_grow,
+					which_grow[INDEX_MAX_KEYS];
+		int			i,
+					which;
+		IndexTuple	newtup;
+
+		/*
+		 * Choose node buffer for index tuple insert.
+		 */
+		gistDeCompressAtt(giststate, r,
+						  itup, NULL, (OffsetNumber) 0, entry, isnull);
+
+		which = -1;
+		*which_grow = -1.0f;
+		sum_grow = 1.0f;
+
+		for (i = 0; i < splitPagesCount && sum_grow; i++)
+		{
+			int			j;
+			RelocationBufferInfo *splitPageInfo = &relocationBuffersInfos[i];
+
+			sum_grow = 0.0f;
+			for (j = 0; j < r->rd_att->natts; j++)
+			{
+				float		usize;
+
+				usize = gistpenalty(giststate, j,
+									&splitPageInfo->entry[j],
+									splitPageInfo->isnull[j],
+									&entry[j], isnull[j]);
+
+				if (which_grow[j] < 0 || usize < which_grow[j])
+				{
+					which = i;
+					which_grow[j] = usize;
+					if (j < r->rd_att->natts - 1 && i == 0)
+						which_grow[j + 1] = -1;
+					sum_grow += which_grow[j];
+				}
+				else if (which_grow[j] == usize)
+					sum_grow += usize;
+				else
+				{
+					sum_grow = 1;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * push item to selected node buffer
+		 */
+		gistPushItupToNodeBuffer(gfbb, relocationBuffersInfos[which].nodeBuffer,
+								 itup);
+
+		/*
+		 * If node buffer was just overflowed then we should add it to the
+		 * emptying stack.
+		 */
+		if (BUFFER_HALF_FILLED(relocationBuffersInfos[which].nodeBuffer, gfbb)
+			&& !relocationBuffersInfos[which].nodeBuffer->queuedForEmptying)
+		{
+			MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
+
+			relocationBuffersInfos[which].nodeBuffer->queuedForEmptying = true;
+			gfbb->bufferEmptyingQueue =
+				lcons(relocationBuffersInfos[which].nodeBuffer,
+					  gfbb->bufferEmptyingQueue);
+			MemoryContextSwitchTo(oldcxt);
+		}
+
+		/*
+		 * adjust tuple of parent page
+		 */
+		newtup = gistgetadjusted(r, relocationBuffersInfos[which].dist->itup,
+								 itup, giststate);
+		if (newtup)
+		{
+			/*
+			 * Parent page index tuple expands. We need to update old index
+			 * tuple with the new one.
+			 */
+			gistDeCompressAtt(giststate, r,
+							  newtup, NULL, (OffsetNumber) 0,
+							  relocationBuffersInfos[which].entry,
+							  relocationBuffersInfos[which].isnull);
+
+			relocationBuffersInfos[which].dist->itup = newtup;
+		}
+	}
+
+	/* Report about splitting for current emptying buffer */
+	if (blocknum == gfbb->currentEmptyingBufferBlockNumber)
+		gfbb->currentEmptyingBufferSplit = true;
+
+	pfree(relocationBuffersInfos);
+}
+
+/*
+ * Return size of node buffer occupied by stored index tuples.
+ */
+int
+gistGetNodeBufferBusySize(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer)
+{
+	int			size;
+
+	/*
+	 * No occupied buffer file blocks means that node buffer is empty.
+	 */
+	if (nodeBuffer->blocksCount == 0)
+		return 0;
+	if (!nodeBuffer->pageBuffer)
+		gistLoadNodeBuffer(gfbb, nodeBuffer);
+
+	/*
+	 * We assume only the last page to be not fully filled.
+	 */
+	size = (BLCKSZ - MAXALIGN(sizeof(uint32))) * nodeBuffer->blocksCount;
+	size -= PAGE_FREE_SPACE(nodeBuffer->pageBuffer);
+	return size;
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 1754a10..bae990b 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -670,13 +670,30 @@ gistoptions(PG_FUNCTION_ARGS)
 {
 	Datum		reloptions = PG_GETARG_DATUM(0);
 	bool		validate = PG_GETARG_BOOL(1);
-	bytea	   *result;
+	relopt_value *options;
+	GiSTOptions *rdopts;
+	int			numoptions;
+	static const relopt_parse_elt tab[] = {
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(GiSTOptions, fillfactor)},
+		{"buffering", RELOPT_TYPE_STRING, offsetof(GiSTOptions, bufferingModeOffset)}
+	};
 
-	result = default_reloptions(reloptions, validate, RELOPT_KIND_GIST);
+	options = parseRelOptions(reloptions, validate, RELOPT_KIND_GIST,
+							  &numoptions);
+
+	/* if none set, we're done */
+	if (numoptions == 0)
+		PG_RETURN_NULL();
+
+	rdopts = allocateReloptStruct(sizeof(GiSTOptions), options, numoptions);
+
+	fillRelOptions((void *) rdopts, sizeof(GiSTOptions), options, numoptions,
+				   validate, tab, lengthof(tab));
+
+	pfree(options);
+
+	PG_RETURN_BYTEA_P(rdopts);
 
-	if (result)
-		PG_RETURN_BYTEA_P(result);
-	PG_RETURN_NULL();
 }
 
 /*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 02c4ec3..9cf4875 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -266,7 +266,8 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 			else
 				GistPageGetOpaque(page)->rightlink = xldata->origrlink;
 			GistPageGetOpaque(page)->nsn = xldata->orignsn;
-			if (i < xlrec.data->npage - 1 && !isrootsplit)
+			if (i < xlrec.data->npage - 1 && !isrootsplit &&
+				!xldata->noFollowRight)
 				GistMarkFollowRight(page);
 			else
 				GistClearFollowRight(page);
@@ -414,7 +415,7 @@ XLogRecPtr
 gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
 			  SplitedPageLayout *dist,
 			  BlockNumber origrlink, GistNSN orignsn,
-			  Buffer leftchildbuf)
+			  Buffer leftchildbuf, bool noFollowFight)
 {
 	XLogRecData *rdata;
 	gistxlogPageSplit xlrec;
@@ -436,6 +437,7 @@ gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
 	xlrec.npage = (uint16) npage;
 	xlrec.leftchild =
 		BufferIsValid(leftchildbuf) ? BufferGetBlockNumber(leftchildbuf) : InvalidBlockNumber;
+	xlrec.noFollowRight = noFollowFight;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = sizeof(gistxlogPageSplit);
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 9fb20a6..a0e41b4 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -17,13 +17,56 @@
 #include "access/gist.h"
 #include "access/itup.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "utils/rbtree.h"
+#include "utils/hsearch.h"
+
+/* Has specified level buffers? */
+#define LEVEL_HAS_BUFFERS(nlevel, gfbb) ((nlevel) != 0 && (nlevel) % (gfbb)->levelStep == 0 && nlevel != (gfbb)->rootitem->level)
+/* Is specified buffer at least half-filled (should be planned for emptying)?*/
+#define BUFFER_HALF_FILLED(nodeBuffer, gfbb) ((nodeBuffer)->blocksCount > (gfbb)->pagesPerBuffer / 2)
+/* Is specified buffer overflowed (can't take index tuples anymore)?*/
+#define BUFFER_OVERFLOWED(nodeBuffer, gfbb) ((nodeBuffer)->blocksCount > (gfbb)->pagesPerBuffer)
 
 /* Buffer lock modes */
 #define GIST_SHARE	BUFFER_LOCK_SHARE
 #define GIST_EXCLUSIVE	BUFFER_LOCK_EXCLUSIVE
 #define GIST_UNLOCK BUFFER_LOCK_UNLOCK
 
+typedef struct
+{
+	BlockNumber prev;
+	uint32		freespace;
+	char		tupledata[1];
+} GISTNodeBufferPage;
+
+#define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
+/* Returns free space in node buffer page */
+#define PAGE_FREE_SPACE(nbp) (nbp->freespace)
+/* Checks if node buffer page is empty */
+#define PAGE_IS_EMPTY(nbp) (nbp->freespace == BLCKSZ - BUFFER_PAGE_DATA_OFFSET)
+/* Checks if node buffers page don't contain sufficient space for index tuple */
+#define PAGE_NO_SPACE(nbp, itup) (PAGE_FREE_SPACE(nbp) < \
+										MAXALIGN(IndexTupleSize(itup)))
+
+/* Buffer of tree node data structure */
+typedef struct
+{
+	/* number of page containing node */
+	BlockNumber nodeBlocknum;
+
+	/* count of blocks occupied by buffer */
+	int32		blocksCount;
+
+	BlockNumber pageBlocknum;
+	GISTNodeBufferPage *pageBuffer;
+
+	/* is this buffer queued for emptying? */
+	bool		queuedForEmptying;
+
+	struct GISTBufferingInsertStack *path;
+} GISTNodeBuffer;
+
 /*
  * GISTSTATE: information needed for any GiST index operation
  *
@@ -44,6 +87,8 @@ typedef struct GISTSTATE
 	/* Collations to pass to the support functions */
 	Oid			supportCollation[INDEX_MAX_KEYS];
 
+	struct GISTBuildBuffers *gfbb;
+
 	TupleDesc	tupdesc;
 } GISTSTATE;
 
@@ -170,6 +215,7 @@ typedef struct gistxlogPageSplit
 
 	BlockNumber leftchild;		/* like in gistxlogPageUpdate */
 	uint16		npage;			/* # of pages in the split */
+	bool		noFollowRight;	/* skip followRight flag setting */
 
 	/*
 	 * follow: 1. gistxlogPage and array of IndexTupleData per page
@@ -225,6 +271,76 @@ typedef struct GISTInsertStack
 	struct GISTInsertStack *parent;
 } GISTInsertStack;
 
+/*
+ * Extended GISTInsertStack for buffering GiST index build. It additionally hold
+ * level number of page.
+ */
+typedef struct GISTBufferingInsertStack
+{
+	/* current page */
+	BlockNumber blkno;
+
+	/* offset of the downlink in the parent page, that points to this page */
+	OffsetNumber downlinkoffnum;
+
+	/* pointer to parent */
+	struct GISTBufferingInsertStack *parent;
+
+	int			refCount;
+
+	/* level number */
+	int			level;
+}	GISTBufferingInsertStack;
+
+/*
+ * Data structure with general information about build buffers.
+ */
+typedef struct GISTBuildBuffers
+{
+	/* memory context which is persistent during buffering build */
+	MemoryContext context;
+	/* underlying files */
+	BufFile    *pfile;
+	/* # of blocks used in underlying files */
+	long		nFileBlocks;
+	/* is freeBlocks[] currently in order? */
+	bool		blocksSorted;
+	/* resizable array of free blocks */
+	long	   *freeBlocks;
+	/* # of currently free blocks */
+	int			nFreeBlocks;
+	/* current allocated length of freeBlocks[] */
+	int			freeBlocksLen;
+
+	/* hash for buffers by block number */
+	HTAB	   *nodeBuffersTab;
+
+	/* stack of buffers for emptying */
+	List	   *bufferEmptyingQueue;
+	/* number of currently emptying buffer */
+	BlockNumber currentEmptyingBufferBlockNumber;
+	/* whether currently emptying buffer was split - a signal to stop emptying */
+	bool		currentEmptyingBufferSplit;
+
+	/* step of levels for buffers location */
+	int			levelStep;
+	/* maximal number of pages occupied by buffer */
+	int			pagesPerBuffer;
+
+	/* array of lists of non-empty buffers on levels for final emptying */
+	List	  **buffersOnLevels;
+	int			buffersOnLevelsLen;
+	int			buffersOnLevelsCount;
+
+	/* dynamic array of block numbers of buffer loaded into main memory */
+	BlockNumber *loadedBuffers;
+	/* number of block numbers */
+	int			loadedBuffersCount;
+	/* length of array */
+	int			loadedBuffersLen;
+	GISTBufferingInsertStack *rootitem;
+}	GISTBuildBuffers;
+
 typedef struct GistSplitVector
 {
 	GIST_SPLITVEC splitVector;	/* to/from PickSplit method */
@@ -286,6 +402,17 @@ extern Datum gistinsert(PG_FUNCTION_ARGS);
 extern MemoryContext createTempGistContext(void);
 extern void initGISTstate(GISTSTATE *giststate, Relation index);
 extern void freeGISTstate(GISTSTATE *giststate);
+void gistdoinsert(Relation r,
+			 IndexTuple itup,
+			 Size freespace,
+			 GISTSTATE *GISTstate);
+bool gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
+				Buffer buffer,
+				IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
+				Buffer leftchildbuf,
+				List **splitinfo,
+				GISTBufferingInsertStack * path);
+void		gistBufferingFindCorrectParent(GISTSTATE *giststate, Relation r, GISTBufferingInsertStack * child);
 
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
@@ -305,7 +432,7 @@ extern XLogRecPtr gistXLogSplit(RelFileNode node,
 			  BlockNumber blkno, bool page_is_leaf,
 			  SplitedPageLayout *dist,
 			  BlockNumber origrlink, GistNSN oldnsn,
-			  Buffer leftchild);
+			  Buffer leftchild, bool noFollowFight);
 
 /* gistget.c */
 extern Datum gistgettuple(PG_FUNCTION_ARGS);
@@ -313,6 +440,16 @@ extern Datum gistgetbitmap(PG_FUNCTION_ARGS);
 
 /* gistutil.c */
 
+/*
+ * Storage type for GiST's reloptions
+ */
+typedef struct GiSTOptions
+{
+	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	int			fillfactor;		/* page fill factor in percent (0..100) */
+	int			bufferingModeOffset;	/* use buffering build? */
+}	GiSTOptions;
+
 #define GiSTPageSize   \
 	( BLCKSZ - SizeOfPageHeaderData - MAXALIGN(sizeof(GISTPageOpaqueData)) )
 
@@ -380,4 +517,25 @@ extern void gistSplitByKey(Relation r, Page page, IndexTuple *itup,
 			   GistSplitVector *v, GistEntryVector *entryvec,
 			   int attno);
 
+/* gistbuild.c */
+extern void gistDecreasePathRefcount(GISTBufferingInsertStack * path);
+
+/* gistbuildbuffers.c */
+extern void gistInitBuildBuffers(GISTBuildBuffers *gfbb, int maxLevel);
+GISTNodeBuffer *gistGetNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
+				  BlockNumber blkno, OffsetNumber downlinkoffnu,
+				  GISTBufferingInsertStack * parent, bool createNew);
+extern void gistPushItupToNodeBuffer(GISTBuildBuffers * gfbb,
+						 GISTNodeBuffer *nodeBuffer, IndexTuple item);
+extern bool gistPopItupFromNodeBuffer(GISTBuildBuffers * gfbb,
+						  GISTNodeBuffer *nodeBuffer, IndexTuple *item);
+extern void gistFreeBuildBuffers(GISTBuildBuffers * gfbb);
+extern void gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb,
+								GISTSTATE *giststate, Relation r,
+							  GISTBufferingInsertStack *path, Buffer buffer,
+								SplitedPageLayout *dist);
+extern int gistGetNodeBufferBusySize(GISTBuildBuffers *gfbb,
+						  GISTNodeBuffer *nodeBuffer);
+extern void gistUnloadNodeBuffers(GISTBuildBuffers *gfbb);
+
 #endif   /* GIST_PRIVATE_H */
#111Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#110)
Re: WIP: Fast GiST index build

On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 16.08.2011 22:10, Heikki Linnakangas wrote:

Here's an version of the patch with a bunch of minor changes:

And here it really is, this time with an attachment...

Thanks a lot. I'm going to start rerunning the tests now.

------
With best regards,
Alexander Korotkov.

#112Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#111)
1 attachment(s)
Re: WIP: Fast GiST index build

On Wed, Aug 17, 2011 at 11:11 AM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 16.08.2011 22:10, Heikki Linnakangas wrote:

Here's an version of the patch with a bunch of minor changes:

And here it really is, this time with an attachment...

Thanks a lot. I'm going to start rerunning the tests now.

First bunch of test results will be available soon (tests running and
results processing take some time). While there is a patch with few small
bugfixes.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.14.2.patch.gzapplication/x-gzip; name=gist_fast_build-0.14.2.patch.gzDownload
��,RNgist_fast_build-0.14.2.patch�;kw�r��_���C2|X�8~Hje�R���/E_�'�����v�+�I���y��(���R&�c������9u00�[$�`��o��z=1{l�S�o��p����G�D���0[��j������z3]_MO�7m�M�|&�{�!�!�w2��,V1,>����S��/�����f�|��(]��(�'8I�����"���b��G�y+�I�*^�(N�e�P��Fi���v)�u�����c�`��l#L���
<2*�����8�EJ�|�z�����L	�u�27JDs�G��2�Y��Ve*t��LSN�
���g2��N�����*��+��u�0����/&0Q�jE+##�*�{:qGJ+���+�Kx@`�P��0`h<!U������7���y�;m�"S7��u"�Tf��}a��JgH 
��|�J�H�]j:
q�A�x4�C�
�#�Q���Sf2�6
�1��@;Of*�S1�L�!�"�y��^H��Oz��nE����#A!�^��yJ�:Y�T�vs4p���[�Ab
���y'4:A��L�7���)���@��8���
�L8*���y�@�h��sGv����4�?���<�v)V�D6�S�/$�`pa�)����HT�y��jE�
������9���@~���0�@���Y^�UnI�hW
�*a�b	�������5_���
`3�yl��bf�udA
�8V����rZ��KM��:KZxt��9)��������'��N�H�{8�y��
'@���$x6�RR�r7GBr����'������������'G�1<�����t2�����H-0&��s4 ���������%C�WRu/��,h��+�� �US���na��].��p�{p��]�;�o����o����`��7�Wo�����������Q�V�z��fA��I
_������*���;��6'��<�5D��d�M��RKe�f��\2/�������-���x�#�������_8�'�R�U>�"-��$Jgo�lK�Y7�
�i�{��lPQ�
����t��Dr�J]*`_|�����u}~�A�K�4 iB�y0�2�����������O�O�q~�Y��[��.R�r0C�:zp:���9���x�S��sePe�v�k�bY�������N�#���k���T�WU�B�Q_g2@�(���$:��^�������J��{��8k$�|���_�4��P�&x��������c����A��;��v�<�z����pjK������
�	1� �����m�
�S2#��44i����^t���j9$x��#F��*�o��t�����~yn����"�"T�'���7��h����'CR�@�!�	�/��_3^^��L�^�n&����\�F��G��M�[����^���#PZ[r���������O����W;������.��`�Z:�
l��dM�L�<�Pe��B8������;xzX!z#S���5�X%E�-��"�R��=�����4��\��(6�ba=A`:�:mv�}>oz[�L&A�h�.Ujr��F�a�d�o����/�D��Q��d���.!�O��$��<��R!��H�5���N�-�e�>}s}6����~z9:��;��N��)�~?l�$f�j^#�>��r��D;��h|����H��(&����X����O��n��7���7z��o���G�z���������s'�z5%��Y�~����J�j���Nk��~��"�3��^����P������z���z'�<O�����!�L���o���O!��x������Nn� l���-���������wB������_N�yi�{�v�Yr�vz����3���y�a\O�
�3���i�O�x����]A�@��3r��Xf���j#�������d�
9���$�")>�^
����;:������Bg�%L^<��J���'���ZXK�!�Sc>�p\��\Q4O���-%q��_A���P)��H+���
�X�iB��;Z�|�I�VBf�_t���fL�����A�H%������`�^�������&�w���`��q\'�1x*����9�����_��W�����#�G�L'n�&Z�dC_U!����^��(�?�@>i���RD-q)3#N���_0�?EiK��?������T�A��3������@[+�a-�_���h��F����&`�Sa>���s�P��X���C�=���?�_<q���1��{b7+u$����	�XA,C��Ht�#�2�������&��F,e�*u2�+�)ME(�{w@;����]��%������O�������,_k������
���kw%�}q��4d.��jX$[�{���?tKSLi���������t<��	,��T"��D�l���,����d
q �IO�d�,�`8�Da�� ������[NL���/��������X��K�(��\���N�kd�����CAH5������b BZKuIWQR :s��F�7�����z�\�4L:���
�10	Ob�9�|r;�0�����bY���qyM�����<���v����io�����h*u]��B��%,��p��d����J�x	�DS�X$/�e��0i8nh��y�����*8c,�������,o ��Ty��g3�{e\�-e�"t���0����yX"���Lw����������{L�Q�&��Uk�wH+�J�)Z�|
�w�J����������b>�%���������]���������}~���} �,�f�m�z��v@�>�������}�������d3���^)
^�*S'g�G����AGUq�@���D���!�^V90(���1���P�Q�w�P�R��8�+ 0�M�$�`V�����&`^�������(����).X�������� �
d�n������}��h��d��	�v��.�<���������l�j�������(��U�������q�������buZ����{�B�>��q�|��.tv��c��oc�E=��>���0_�_7��_a*�nK#�����^��b����+77�SPs��"/����g��7�i��Nj��7�_	H[ w���Q-�E�m�zs��4���]��+��x!���Y��[��-�c��h��{�����Y%���@u�i.t��cX�S�upq���@���#�^�r0�m�`�<+{=��i�
qi�_K
�4�5V�o����	��3-�s_+D6�u$!JG�
,M:m����"}(���I}���-n0�<d��(��]:�5�q�J�9$��U+\H��
PU��G���>��riY��r�����#c7����w�W	�C���\��A��
>���.	LUG�"�(�P�t���q�"d�%.<u�]�x�C�>��b��;'������n�R��
k�$ Q��&�35�x������!�Iu
�n�X��r�$�[RMZ��P��h��o���J���)�"�S�*$��!�������"�
�8��B�7���`����2rD�z�?fw����l@��q4zblE	+DX���+W�W�y�����"Z�)�V��M�#
���efV�M���������`Y�M.�XJ�(���
y~�*���d���7^P��&7���@����v���La�U�c8�+�����bOPp\���%!�y���V�,�9�w��Nz����Ye�������������+b0����#��G_����:�G���i�Am����}��9���#�g�W�u_�,~1	R���vVt_��GG�{d-���S�pTVF���G��?.B�B,� ��Rrn5Hu �N j=��1�Zj�������P�(�����
�M��Q������s��^����})���R�5P�Hr�G=�j�V[*T����p�zp�X��U��Y)���O^)��+D3�mX+���d����TC�u��|TWc�l-�R�	V����]3e���3Qzg<����;37l���td����z���)�}���e�(���Qy��"k�+pt�w����������T�v�#x������~��d�80��{��:���GXJ��Vp���u�)�u�\f���4(L�:+�)��G�������J�5�&RU�T5���
e fd�k��+�0���(���T������7�sxG���[{q)b�,O}/�2
�RMe�,dB�<� �Z:�P�"��<����	�E��ln"�@m�|T�TX�8�G�K�W�$�x�n��9������6$�r�>��j��������9���b���������f`eay;���Sv����2iKt���@�Y��Ww���2p��e,��*�|��{aE%��,��Z�UN�����K%W�\��n%R�`V�}���c��e�R�����:Q�i��u����`��G!3������b����~(����[�;�H���*�1 ��;�URXNC�}���@���/�H6�����)N��T~g��h�i��n�}�i�~��5hX��A����L�����+R'h��h��f5f�)K������j!%nE��N�%0������r�?�A�3�1lL��K�D��G�~�����,�L�]S��gZ����8������'�Ag�Nf-��]|b
��#����x���3���O���F����d�����
���4�4OH���e���:����!�=}KA5��m�zc�q��@��W74u��Sq�
gl��*ZQ3���1�da�Ei�/�A��+m�$�q�)�_�m�.��up�0��@�np�Q:��XxU��=�AxsY�`���X0���k�b�j��}�6R����M�4.���sn���1S>���NdR�E�O�#s�n�C�N���~��Ps��D0s�!:L���
�Eur��i!D��_�����������i#����!��B*��
��#�8��g��Q~P����f]�4r������I�,q�����p*����G�IP
s4(s���k_��CJ-]}���%)��?���X�0�_��%����>��?���a�h4��,�o{����/A�\�bg5���h�H������~	
2o��N�	�����1�.^����E!hmk.��=}w=�����$�����h������Q����]��+��L���c+:}�LX����6fgi>�>�h�W�Gh]
��d{�u�����'B+�8�C�\�����&�dH������$fb�GA�=D�;�A1�P�U�<@�~^�B�+����������h����X<m��C��pM����|l������6�I`
���LT��_� �p�K�`	��,��������-�L�II�~��L�*@W�G�U��g�.���+�f��J0�SC���D&T�W���1�����z2}}���5����E�%�/�8�O�o����fH��*���E�a�����������Vf�<�,��f{�*��0w�pz�)�����_/�b��w6qV�E�?��<�!O=���Sf��=�m��!�N�p�ES���C�vLY��#J�Q���'��`��T$7�B�cj�,�!��)"<r��-u�OGg]x|}����-���]��@:(P�����w8�>:��QL&Q�.����x�'|S�Q����>�z�V��S.���1��2�:{DX���/on�8���+�B$��eFrQ<$�)�/I�I���Z ������k�gv�'�SO*%K���9zz���!� ����=)�XgV�r���y~n�������V�3��T]�������P6D�
��l�8���'��������	�fD��X�M�p�%������Hqr�|vj5�y��^�jUt|c�-���_;i�f���S�!�##KY^8U��I�;�<&�F�U���6�Q�*xB�(m!�[��+�^��u�2���&���#n�e�%L�����J��k#C��\�'�g�N��N��w�F"�g�e�HX
	����2�N�X��J�Z�S���[6���i+qR�� �GT����T;xF�2�z�N�V(�\~��zHo}� ���C"'���c��?��6�����{�e(	�#�))�wZc<�����|d�6R�Eqy�]B,�j�q>1�w�Uf�~�����n3+���c�P���I�DO'��(�{pg�C3Y�����3�IW��F�e��S����P�#H����Y�!�,|�,0�Q$���y���+n�m)!�����J��k���/�n�1Y��nK����l�N�`���&1MN�%�C����2���;aV uD.�������tb��J��� 9�(�g�)���

;�{[����������NZ���������K/��������b���n����8#8%���%��J��%�*����[�V@����>`o`�a]�t�t���q�J���q{��*���z#�e�������&f=�6E�^�I��t5(��C���@�I�z�J�C��E�oD��pP�l;w�&�I�������B�R�>%��="b��p2Nx�h�N3Y�=_o����(rZR�xSa�L=�G��Ot���"�(� cC{�pX�ea'�,[�h�<[����
M:n
_�oq8S�}m����a�}��}��Hl<�@*��l[�x�A�L���#��CF��d�	N���6��Gh�\\D�Vn���%������L����st��Em:�V�%�����j%����]�D�oFt�D���7��-�
<"$���fRP�����]��(s�O(�����1{O����^}�������."J�A����9���e��If~���1���	3,2�����������M0�`D����9o���;�n�rK}���O�]C�!�J`��'�}�����H�M.�Pg����kv:%[� X���|�\JBQ��,� �9���`�F�*�:p���)��I����L��?df��4�W��[��_\��-b����������������*��kT X��Z^@�o��i�<'��������&�7
v�X��*�d���7�������Cv�g$���"��\@�$>��,�7�X#%�>�)�����6�����������C6�@
n1b��fr����������dY0):������������x���������-H}�OW�k�-�p���i6AH��9�F0|�xMw	�"���in��l�]��	���;I�`I9���p�)��F�C�&������I�b�2\!cn�G���	�A��hDN������H�i��a���Q.�)�F�T�<3�q���&E��s9��/��	9��m6����P��lw.�O��)���h�^�f�M�Ilb�$�?������ 2#�����>	*���h�qC$x�pa�'uI���c/�m��g9�R�M(�S���H����Y��x�op��F����������_������K�%�28@��l��Dk@��E�����x~��h���I~�l�lB?�a�1n��e�J|��b�=�i��M�!��^__U|���(�������6{�q�����Q���^����g��U���������oF�������tw4J��&#6���9�+)�/�b�_��%>iW YXH���>?�������h�u/ �=~�^������G����b������W7���CT�Q�%������'u�
�g��90�o�����f��j����!���� z;��@��.�������4��n����I�K���k�������M<xU_�����R��F����Rd�%�Vo������O5��t�.��e.�b��
8����G1���$!�*��7�o�������I��r�E�	b S/���i�h�C���7V���X����UG\�E������c��E��E���O��2�L��>bF g�A$�\Y�+V���jg5�N�_M�$�{���KEn��ru���Y�������P_����Un�^�������@��Xb�H)�@�8%��b����`1i�;�9��%��h0flKK<�����8�=I��������Y�mcx1�������S��)
A7���o�i��Y��~��&�C�g����&����>����3��^�d��� �@KC�pI���80�-d�zbG��_���������=���RV�^�����4��W�`�6�n�#�1D�������<<z?]i?~�
u���@cZ �Z��M/��FSy>�U����O��,k
^wcWNK�d���J�I��;�,CPJ/�������T(/(��X?&���}��,�h=�w����W���8�x\���'��~��o�#-���!��'���Y��:��������5Ed`�;6���
�%�Z�oY���w{�5�L�`�JK�Q#��-�2��p�1����^�;t�Q��BXd;vaQPY���q�s��l	������4�>&������0�po&�'�+���W�6��|�c^Mlp�P���XF��R:�uG\��^b,�"�>[�~���	��u�������k�$��>�^][��f|yF�x�X�~�k"�����YA�=��&5���R���.|aS�K^DI�|��W���y@����}�j�����s`��&M�7�z��6V7��F
����r���2f���������;��j��
���]��R%Fm���`����wX�XJ����`{�����Q/��*�H�$�&��������s�`��B�~��Tl�
���u�RF$R�fQuy\0��v1�#���������,n������W;�!�_���62�hVW@�R
X��a�%��^�������y�F���{x�����-��g�l�v@�qI��9���G2�l�^���33���8���A#D�������<�zr6�6�9�\���[m����}���c��vu}9�{��������2�n5]������BBr
�*�4|�(�z_e���f
�OK,���
]�m[/�� lk������?��m�9�~��������q���I}��w������?�;=����30�=��K�nfX�=.��:��ig�t����N����$}y3H�z�`��1��3+'+��4e�����<M�Q�9�;c��]8���*z�.��m��+��.��2�m�<p���#��!�m����>��kS/���,�����]Y�z��-y���o��MK�����&q+����/a��
/m=�#hKb)��7Vu�Dk�6F(6�1�%�0r�������&u~)�:_�������8��a7#m�>�FM�������o�F��i�J�!dU��k!�p(cP8a��!�[�5�h�v,8��@������p&y<���k�r���J%wH~r��g?w�����G>:�%X�a������b�C��KP5�X�2rR�c�O�!�&;�@O-��1[���QY���_�2��MQ��I���)��B��A-}��E�v����2����j[�����@�T�2�7Q=G0�� ��e#\B�
��~������/�8W�Z�
��;���s�\	��8e�X���x��^]���a��3�Z�������B�Lc���u��J6��[��ej��>������;�u�$+NQ�|��	�(!'
;L�w�>���l�I����K)����]�`��/�H�INn��;��!,���f
r?)����F�������D�Q�D�N�8���|�K���t �6Ot85�$G}���\�"H'�tmr�?E�+���������L�U�����K�[S�B�d��T1����K9n��9���/q,�z`"0�8��0Q����YS�\��l�)�X;�#�b�l9,�97�����[��M���	���,�����:=1l\���X��ti3��r�bz5�E��Z�T�#o-T�Y�b�[	U�(m
���W�����N��������}�-�W�0���K
\����H5dl������������ U�(�}rj� �����di���n�Z��{�^��L4].,$��w��l��^s�4�7c�A+�Xxz��L���]��>���UQ����\� �poc�X�~��Wb�#>_>�����n���@C��l������ �9�buz+.�Os!�${GU�������������U��L�c
�	0[��9'u�Zz+�(��,"����D+�:.����B���Z{���bi�z��$���`���Hxc��a��)�����o3)*o�����2%�=�>�/$�@�n'��&���Z���f�;���cB #0I����.	��7���
��2�"��|d�Vi��b���J�������[��qP����D":�����	�ww#}!������'�"��[�x��Ba2�a���8?����!s��9FF"d����9T�~���KwEn�Nrh���	!d�^<L��^E�D�O�1�Z��W��Bq��bHG�����~�k
�uO�[���������0 /�[��p�(��x���� KD��}��E�-��|S^��|VT������$�U�!���N��R����_����W���T�n����mU'�Eq�
�����r�
�Js9����0��S
����va'4#i�����"@�Pe��N��dH���O^�R��G�wO��~��JfV����M��������z�1��;�����Y�k�����N� >%��?8�v��b�����[0|����Sxs��j~q�=i���Qf�2�'�
 �5p�S�<`��Z-V�;U�������B"���h�@lw��Y:&W	bH���Y}--�Ho�=���7���hJ&���}\�(R/������*�0��k��2�2�����\�A��Ft�:�D�����[� �:0R����mE,u��g��ssR���<�c9���Q���`X �8[�h �����z}�44C�
;�
��Y�(�<Zv�
NN'�
�r_��`
�J�pLM�2����6���J7�3�d���T����(���&9E���:�q~L��JE��BB����m�Ioc���)s :�F�V�I���$%�F?������-n=Q���_B��l4�Y������m�=�F����PX��R���g���P�S:���A��H����~c�']�!����I���G���uv��5%n�c���{q��
����V@����o(�\�MW��������p"r�

C&��B�$w��: �(�2�-�nH�f�	�s�k�|��b�;}u����GN�������A�w��El�����]��T��8W�^^��p54g����}O�<^wS�a{�*�4�ck�-�����V?��������x��_(���S?,�D�G52�A���'��;��0�1�B����T�GW�b���FQ������������T�fJ��� �/�]Y��p���Fd#�R�;�I���}L�QRBh��~��8���~��P������k�T�@;���c0�BQ�=�>�b�3{�/n���t����%P�����j�P}5�L\sL:���:�������h�p[	��q%����@L��-7��u���O�yz!������W����g�Ki�������}�����RMm��/��6�a+m�\�3*��*��c����=iW������!p�4����fk�X�C&^Q sug��|`Qun]=������EJ9��EF���"���&V)��������M��=�C$�G�R/
�Y�c+��@�P�l��'�[R��H���1@�Y$_YE��)'�������g����E���������oB��Whbh	aA=j'+���z`A]5��x�`8.6|c,��
�l�0��=����V���JA�����%��������}id�QD������S�-B�B���.��$�$�.O�%����]~�������|����;'�����%e�h{��H��,���-�D��:O������)y�,�;�e�Vgn|�i�������7@�������r` >�,��P����(���Z��Q�s�!�s3���k(�<^6{�|�cL_z��6���y_g@��-�@�
�'��N�����3��\��pR��5�P��5�������@]T��b�������G�wP/�fR\�����8��R��0"�>$�P��������"6��C����xxAU%�p.xZ����aY��hI#
r'�T����B��!�J�q��GP������9��:�$��]�s\�s$������X�k;y��x�wt������V�����gM){�L^���h�#j�1��>E;�a%��l|���'����U������ai�6__��L�B�
v�Gq@le����>��9��
���U�~w�lK���������Y�>��tM��6i�����N(�6�:H��/��g�1CI����mL.9J��uw0h~.�M��������sU����@ @������i��8��1j6�36�hq
��VI��� 7���#��1bH�;!F��d�/U�,���{�~����ui]jW�����/��m���nk����x)���9��o�?�L�S#���� P���?;Y�
���5���8�{��p6G}�Q}z�.'�+�U�������A������n�
���/��fs�r(U7�)������EK��3rl$��dFc�Lt��W���Ql��'��XF����Rh���G-%ks���]����?�~k��i�[�0G^���k�S���sHFC�b2
X�P?G�Drl`V�����-��x�1�C��#G|F&��S����V�N�q���/1*+�O���T�9���`L��Y�9��"n�Y���}��Jk�'��/���@�k�:,���d�hI�PX�^��iq��0�5����S2w��6�2��TS���J4���\h!�Q	��\Y�`!��
�������$[��������{�_�Z|�������-�1���*�)Q���'�Q�.����p0�V��.��5-��������k�U�]1f���m>���+]E�����J���e�*�/4\���_{��K�;�]����8�b�66Uf���Rm�2���2�������x�
S���['ovNN�����X������KXI7Axe�P��D�Gs���$��K��=���,�Q�����8]2���n��|j`���-���u�`��aBi�X���[������������y���c0��o�e�����$�bs���^�Edo�Mn���?�2Vi��(4�Z\���0k�J$�B�2��j�S<���mj�S���m���r�B�wc�y<����k	���9E�Rr��)E�.F$�6��pGV�E��l%WFN1+��8B�����+nh�������}�f`%nO6F����s����~���>vI�|����!P�K����@�����8tdt���(e����0�S����_��a�56[����`���R;��,���|�|�h�U����+
��O:�e�eXs
��O����E+2F*�>��@?Z���Z�d�fL�"�Jwvy�O�D�-�^�-���'K|
����XO�^s�U0�K�� �J�����a�����T+[�!�oR^�y�U;�������X�|�[d4fta���=j2�p�r��|�[	�#�Z������'&n���"�����g��Ubo�h=�)����0g}Vns���;�
��s%YJ^�s�\3v:�����"^��94���Z���/-o����^x�5��)q��l,�v)�#���tl����"���S��y���fV$t�O�5��}�;��|�\�����]_����&i�!�g����hf���3tjH�g�27���2y�a���fC!"XF+�K������xksx��`
�H�[X,$�1��lc��d�~^^9��eq���U���B�>��,�hrzIL���%8k�eum)�s�_�xV����h����u��@{�t:8U`��L���3Q��v�9��j.���s��q�<��rD3����O;Y�<��c
���{rY�����XI���nv~[�?�_C9��7`<b%�5���r�0�y���E�7������~)*������	u�"��f7���w��A�%{�K�����rR�]�zz���K8���L�M���b��w�K7%x��vt���b���pg���n*2���EC7����Uf�HB��-���]���f�����_�r��'Mda���2�)�H�X5�(���7D�����$���@�tf�x�0F�o��&��wB�J�v�����4�JM�YV���������m��H�&�Di2�� M_��`vc,��n���:"�'���k�5�!���eI�s-��kn��`V 8�8J:����k�H�����2��M�n	s�x
��1,tB�!���2����W�^�z)������)s�<`K[�\pe�
���)�tO��}���J����}����Pp������5}8p�x�l����#K��5�Q	�v�(��V����I�?�"�U�7!.V����3Y+�s��~|@�5�����N�weo+F5)�YfZ��hE
��;�=��"��D��A@	D�{v?g�tQ����ny�j}�t~��
l�'\x��a����J���&$
��1�j��#i�9����{F$��j����;���|X���c�t3��^��F*���z$�;K2e�[r������{��x}�����������/\��Vb�y�Q��G��g�'��n�]�3^Zm��L���,������������H'������1z�
�n`/wB�����5����*z���<�|�@@��<5��XdrE��������������q����3������@5V��������4�����VR�.]]�cB��b���'��t�$>�����v�GM��a���A�9s�?#����3���~�K���(���{������d���[���g��~�&(������u'\�����o)���K:���A����(/�^1���%h\���x��%N�M��
�u���x��B�Z��{��R�x�����?X�
$1A�L����kC[[T�\����&�a�l|���r*8�w�U�z�����}f��=N�r���a���W����>������#trv�}�v~���J�>��q]�\������=���9[v'O��DK��P<:�"J��?��n{j�	��h����((J"ccp��������FvG<l���gS]�w�K��s��������5�k�dw�l�|�pg���d�l� �x�7���`�����,������	��v/=g�j������by���}3��;.�T�Ub����zfc��&Td�v�`�5y����/TH�sOC�*'�	�o�:j
@�������z���������#�Z�&y��i�;V����h�"I�9?Fw������o��������{�2�\.s�������':``VzZ5z�e?G��8�&�KQ�T��5cU=���c�(���y���2wR'�l��T�9r�:~*��\<�-JWc6��a�t��uQN�����X��N�.��������IA��_^�B��iX1�N!n�c��N��1c������3{�
P���@�U����F�s�C(Pq�lf���4�.2CHF�V��zr~:���z�^��.�O����dd���Wc�~��2/O����a����2�a��A��R4��9��6�M��5,�)�ur�������k�������
$���������xHI<c�����e�������$:������?�qF�����u_�s'�i��iY�n��oh��X/;M1=�/�E������D�1%83�H�
������[�l��RG?���
}���8�������	5������)>�wm��w���{���j �a2�;�*S��fh4U��J*��4O�a��f�����Z8)�zY~	�qne%�$$mp�\�Hl�+�����'�g�N�i�z���/�M(Q*��(\}�o*Q�e+�Ux���C�����Z��4��wV.�W���0�0T%
�F���?�J!GZ�b	"���f��c�)�i��[[��B�L}�/_}��M�����z�'�k3���=�a��F�Ql�Gva��,��y@�=���b����Y�!�L����&Fi0E�N�?��t�z1�}�p��Ly�M���^%�}1~�6/�`x3V9D

��y�dJ�u%��:}�}v�h@���dP�	';����	����"�-Xs=�����<�grAb�h�+��5�4N��s�ukT�'��'�$�h��X��o�;@���Z�!���N����(�;�mU*vC��,#�H�����27V�������B���r�`o{$s�_~�t'���c#�YzA�9��Q�E7'��E6��2�V��|*�i=~�/�)��!�+�*����*7uK3�q��g1~����@��d������=�N!��`�+�1���W���������O�J0�&�m�?��'�|�`�m�+�}���=���8��6~{�u�[%MT��U(`�G�t8d���
������z>7&]�ts��<o~�����z��=2�B�hh;`�=x��Z�1�������uH����r���p-6G�zz����Z`�L�}*��~pXS3�t���aTB�B6��R�_�fZ�1��'��-�K����@�
H?��EM��0�����J�����Wj:e
-+tH�?+I����g0)0��YuuqJV���sXM��-R�,X�F����V�@l�|j��(�U�N��0�)������ u������G��@����wo]����J�
|xJ?��f����F!�t���,����o7	-B5"�1���`\�(���\P,�t&R���'���+A`(YsQB�
't����H!�i�H��) ��*m6v-�A�!�hV��6����P������� ��3e+1���g�xeu\fY�Q^�B"��H�yJO+Y�-����B�g�la�w,1����/�d��X���O�4�Q=T'	���)�
~e�#�CC�{�0/
���E��t�"��i�acj�������s�SF��c�7��W�V|�~�:�����N)���Q5vi��y*��EP�� �S
�E�BD�Fm:�����w5�����t����x�C4��B����F�o�\��Zg�?r^H��0�*�{����,�c�<���v�9���7�&��M���W�{��s���o|A�	<~;�#@��/�IO������CXrK�X����I��<
�
}\\@][e����Rj�����c�����:���.k�������6C�p��8�$���0<���~;_������}--���a��LH��B����s��e��V���g���5�����$�K��H��bV��!� �{�����g~��c�'�\����=�+w�{2��������L��<��9A��M���9Dob
�-��3���j@Z�����rB�����*���%��7��G����j9��+n&�9]�������hn�nM�j0-�=�^��k��v�2�/�*��S�V����pV1�>+��f��c\$�1<6^I��7K�a�P�dN�����d�U
kzh��,�������{'�����[��M7��F��*��zg�l��ho�t�lS���6�nnb�BO�$S���9+l�N��tj�����d��L����G������(���K�s��`���{���5{c�g�	�K�y5KdN���2�������-������;Wo���n��v���y]Cb(U\���J�������M��t�D���<0����rM�\&
��5��BR'���k��,��_V�� �e
��]t�%���f��S*�[;��C�)��'."��=!&�Aw0�#���%q�
���N�M�:>�^9�R�����,�����bz��
g�sPD`m�����
��zy�a���������w��nq��L��[I�H�z� ��\��d6�q��)_�zZL��W�K_�QrK����k��i/(]�;�#X��u�!�P1�o�b1}���G�vS����uU���)Z��Nb�A���;�y�X�t�����"�b���B<�YYS0n�X�=c��r�6��x��7Y�]�������2�V�$7>v3k"��h[�_rJA��0�����&@�����K��8�����m����W�0p_,�vB7���	�dL@���b�g��������$]��kB�nF����|vD�����\��� 6����l~�bAKu
���LSb�tb�rHU_��1"�A!+��g����%b��cA��O�����@���S�=���� (!Rt����d,�ej?��k�FA�	
��T1Dg�Amq�a���H��"���a�[�
��)W�"7#t�fc*I�FS����8_�\p=]p��*���n�O�w���mz�1U�T~SkP�O-���YI
Ok�U�[$�C��/[��+��R��Z�%�����<������@�{�*�[U�����x�5$a�b����m
Z)�"��+1t��"�Gr|',��+X3gx�3�}�2���j�>���)���Jx��������������z}o�����_�|�����>���[�f�3�JG�.����#D����|]3�7�a���n��a�x������^��kh<�MG-�j�(G��&@d�P��Y[����z�^�^��:�&�oWM�>�`�%��.}>`2�����x�����u*��8������2��B��i#�l�u��A�n���K{��g
����qf����jD����,����1�Z%XD�rfm��
+��z8�\�
O:���<`/�����A�<�W]�s���R��F�����E��`����}��lwn�[�+�cr�"�� ����y��Y�����?h4���,Xo@�����k��}k���'<<�����d�P���l�d[Ve��=��pi�����x���!�F@$-�)�I?��2����KB����8������1	%,�C��-�����"J2� Z�5�,�|�|�~�������`/��ga�k!N]�=v�
�|s��������l(���~�mx��-�����?�]�-�PS0�!)`�mB�O?�m�qV�`���R����g$�	k���F#5b�����$��Q���EM��*��� ���r
���	|���/ j�'��P��p�pXI���%����xYm���]�1�����. %;��f��`��3{�r�h���W�����H,����gLI�X�N~}�� bp����N����At[�bE�GiYR6V]��j�BXH�g�U�q�]<a�7�IN�\�����:Z/B����"Kw��q��7�cP!������$�����	>�Y��r�
72��Ad�!�����������QR�f*z��k��=�����>�"�u>b���
���L
�������8~�������N/��O��*(��O�w�W14�Ee��q��0?I�����>U����[�QKQ9?������/
�R�	�A,J�B~J���M%��j����Ais� �.+��fXU�,�!��#D���ucs(z�l��ZL��*i�W�ZP0�����U*�g������=��
�&t�Ah��#��4=�;���6�d�2!^.�������nk0�J��w6Fd��0�UA��z��D7t�|C�o3�0� �^A�wX���3��
��Z:!a���������>��
#�=<;�[��B0���[=�a�o���2P`�
0�4���t�����T��|��a���$�W������-`��;���z(��Z!��#��!����M��������������~������5� ��u�h��~.�.�;�A���%��c'�b	����q�A�z�mm�����UePq��!�&�5�/���d�
A��W���P]B��t�N�	���F��\7����"m�E)��dU������M69�}�{p�f��a'O�����	�"%U�I�G��q�+�������J3���\I�c��N%�����*� y 7����
8P�;��*�8D��aH����!T��I	(2��k�����J�������i.�%~��AoL�~�T'u'~u
�������Y	�_�\���'?�<�Y�����l��7��dj�cYL���/�n�YU�������V�YI�;"dF���C���a�=)���7y�`�'^��Qs�D
��"�'C~f��"u��D[�e��WR�kxwK<R��Q��X��$�9
gw��W0��'���c/P�LQ
J�� �T6�������[��W���Ox���Ui�-T]M14A�d(*&2��ZnUU�q��4�����ER�o

�'l��ed��X��*�XAnz���-4�.��&���ZP�J��'d���E�>f�V�����6S-4��/j���{bf0�O��tS�
��P�l1{���HJ�+-���m�v\�k��)6����@i(h-ywp�������%�����!,�hSn��G+#'�Z�(���ub�x!�|T��PBZG��pi	)�J3�� �����B��!I�*����G�@@���d�zI���N���*MJ%b^�G�g��6�w��k����
^��\�7�~��r(��*�L6��-l.���P���z�B��w2����'��X{����^6.�3��������`s=�'��@�i'M}'�����;�S�3�b���t�O���C|t��p�J��c	��D'0�},Z9�!E;�
���c�}f�s���Y��~����2T�U��x��Nr4��N���\��������F�QFs��F5^�h��i-:7�� M�v�s��W:�v�����E1���u]k��	���	���������H��]�\�_��[K�(d�OQU^r���H����!X�m,�;�4�"��2���%!���|�$���j,�U�_;+}z ���H�$c�v���D^Ui�|�hIa��_b�C.P�PKt�y��bg@C����������:
��������XnJU����f6:���7�����g�_������I�xY��
���F���6�����������_�_�a�����g����uh��b	Vv��<��������"<�kK/��JFhwE�����Z�����_n�������'|��h-������
���
C&VC�8c}�y������h��z�!��g{6�F�WH}���.�X�D4�U�8���������i�=
��2_H���b��c1������g0��u�&�#�x|��'�kE���5}�"�E���E���%^fz/�5��h��Zx{����Q��Cr����c7@�r�>��VXb%[	c[q1���Q�%Bqo\�U������fc�1����z^�@��>��g������i����>?�P#R]��	�Xw	�+�X�����������_�'���\����u�7��M��%�����"�<���My��,_����+%�X��P#� ��:K������'��EJ��l��<$
N��R��1~��������~=���(v6����\_����I��7mG�/*�b�b�������|H:��TE�t�(T�
cJ�<]i?y��eJ���k��pbH*IX��x�--�j��l��{�U��u�	<S�{utt�u���fw�,��S)�$��-*��u���M��v"h''�G�g�?ek.�0�-�q������og�[�����������=%p�a�q��?�r.���PWT]�V��'^)�q�<��9F��Z7�h(���>��\f��Y�Iz�H�����=�s�O��a=�z��c(\y�N\>�A;q��~ks?6�%������d�r��[�A��	U���9
���h�����[*\rx��s�^^X�I����O�g~g�cw��*���6��X��Np��h ����2����Hd���j�6�>�,��:��0'�-��\�lx9�2}���w�nG9J8:{����h���\��8���'������d=�����$��^��BM������#`B�_�w����P��1���(5R;����"��h��4[��K�����Mz���a9^3�l��`�1�,�1u��/3����%�hP�j�aB�^a����F����h��G���R����Z�o&�����P������������K�����Ef���(V�8i`H'�j��y^��A��4��.��>���}0�#��>=Lx��$�������w�.���U�J+�iOm���c�
�BD�eS����	�|�rD�d�I{cc����_\����[wt���i1".H��OZ	�����'/p
��Z������D�}<��o�]����$�9�u���o���s��7��n������T������hU5u	Mo��������������x+�����Y�(��������E���\H��<D��]Q��-�f��7��!`���*���2:�CT��:"/i��[2����?SX|K!+�d�@{�M���	��Z4�}=NVX��
!��4�/s�v���*����f�QV��05���9q2lth�j�c{��M���6)ui�3��+�A�Y|�G�wO��~�V�m�$g%D�\c���~b��� �o�Nv<���������_��������_����y���O0����0i)���_!��61����)n������`�C�����'���l���+�����.Ff�������p��[������ ��]�n�g8��j�}������p��_��`������ f������
*N�/�?���1���W6V�(�!��OIA08T�2O����k(��q*@
���������7��v���P��HDy����E�r�O����TA|!�G��!}�R�H3��H�����$��;/����������,r�����������Rv�}
o�gO��7���.�V5s(%��f�&qW��h�(��{����1Iu�B31/L�y'+��� ��Zn��Y\�Yi�>}�xE�"��H�kN���k2��&&Tr���������d�F�dU��$r�a9�w������-7�-�%�+�+$�<a�_,��b�����7:���63
�,P�A��&�������"uZ������'���M����Q���D���&��@��Ct��P'�
Ed���@���~U�����xZ]V��V�Q|A
���j���*�\&d�|�m�a��N>��%3�����
F���{0�u	2���H����[N�������6I#�$���������
"������H�l��2j��:��=�ix��jq��o1������D����O�^�zC5}��J>�>���wI����Q�kM��J������8�Jb�#�)��y�p���&����r�\�2����t?^���J,b�xD�"��f��l�N�O�P�RYT:�R�a��Hh�����[��=��c��Xz���0����[T���YN�!�/!��x&�(�QG��[�T�G��
}�0)C��0��_�!?�.6
_�S�TKUw��-�B6T{��K�S�[��[j�����	�_�a�4{w��:���j�,������8x��U�����IV��v9��2C�-�|�m���Bp\����w�M@�{���N���������&��(4M�G�#r�w?�5��d�Q�����s6h���	�FQ��x��w�m�+k���g"�����`+��=�~E�����A]r���4�y���nFi�����]������<C�5MJf�)?���X-Z~U_���/��cHw�0����3OnZj7k�^PX�Fu���7O<2��_n3f;�G\E	�����1'�.��n2-\}��])pk��N�[Op�������.����%�_.+K������f�x��W��W�3�\�3�1����J��yRd}-J�_�K
W�C�gh�����G�?SN�?�L��o3�(���JyV�mMo��:�.��J{��c���)d�ub�OE��X��E�a����Z��E#�s�i`��80�3<'k��tl��B6"@p5{%d7F���p�w�[��j�O����t��u��tAn�t:�++~����n�
l����l���<A�0�$��U��F���>4x�sD%�k��{|M�I��H��l���l�#{'!����!���wq<^}�~�����P�H?���� S�
#N?f]{����	��1E������U
��7���/:m�g�����=���������Gl�g���y
^'5�7#�1'�9��X�44W_����g�<3/vl��y?�~�Nxn���������D����k;������D%:�_�����gG ��	�U��:���l���	^O�����7�v�������S�j
#113Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#112)
Re: WIP: Fast GiST index build

I've added some testing results to the wiki page:
http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011
There are not all the results I planned for the first chunk because it takes
more time than I expect.
Some notes about it.

Now I see two causes which accelerate regular build of GiST indexes:
1) As it was noted before regular index build of pretty ordered dataset is
fast.
2) I found that worse index is faster to build. I mean worse index is index
with higher overlaps. Function gistchoose selects the first index tuple with
zero penalty if any. Thus, with higher overlap in root page only few index
tuples of it will be choosed for insert. And, recursively, only small part
of the tree will be used for actual inserts. And that part of tree can
easier fit to the cache. Thus, high overlaps makes inserts cheaper as much
as searches expensiver.

In the tests on the first version of patch I found index quality of regular
build much better than it of buffering build (without neighborrelocation).
Now it's similar, though it's because index quality of regular index build
become worse. There by in current tests regular index build is faster than
in previous. I see following possible causes of it:
1) I didn't save source random data. So, now it's a new random data.
2) Some environment parameters of my test setup may alters, though I doubt.
Despite these possible explanation it seems quite strange for me.

In order to compare index build methods on more qualitative indexes, I've
tried to build indexes with my double sorting split method (see:
http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36). So
on uniform dataset search is faster in about 10 times! And, as it was
expected, regular index build becomes much slower. It runs more than 60
hours and while only 50% of index is complete (estimated by file sizes).

Also, automatic switching to buffering build shows better index quality
results in all the tests. While it's hard for me to explain that.

------
With best regards,
Alexander Korotkov.

#114Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#113)
Re: WIP: Fast GiST index build

On 24.08.2011 16:57, Alexander Korotkov wrote:

I've added some testing results to the wiki page:
http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011
There are not all the results I planned for the first chunk because it takes
more time than I expect.
Some notes about it.

Now I see two causes which accelerate regular build of GiST indexes:
1) As it was noted before regular index build of pretty ordered dataset is
fast.
2) I found that worse index is faster to build. I mean worse index is index
with higher overlaps. Function gistchoose selects the first index tuple with
zero penalty if any. Thus, with higher overlap in root page only few index
tuples of it will be choosed for insert. And, recursively, only small part
of the tree will be used for actual inserts. And that part of tree can
easier fit to the cache. Thus, high overlaps makes inserts cheaper as much
as searches expensiver.

As an extreme case, a trivial penalty function that just always returns
0 will make index build fast - but the index will be useless for querying.

In the tests on the first version of patch I found index quality of regular
build much better than it of buffering build (without neighborrelocation).
Now it's similar, though it's because index quality of regular index build
become worse. There by in current tests regular index build is faster than
in previous. I see following possible causes of it:
1) I didn't save source random data. So, now it's a new random data.
2) Some environment parameters of my test setup may alters, though I doubt.
Despite these possible explanation it seems quite strange for me.

That's pretty surprising. Assuming the data is truly random, I wouldn't
expect a big difference in the index quality of one random data set over
another. If the index quality depends so much on, say, the distribution
of the few first tuples that are inserted to it, that's a quite
interesting find on its own, and merits some further research.

In order to compare index build methods on more qualitative indexes, I've
tried to build indexes with my double sorting split method (see:
http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36). So
on uniform dataset search is faster in about 10 times! And, as it was
expected, regular index build becomes much slower. It runs more than 60
hours and while only 50% of index is complete (estimated by file sizes).

Also, automatic switching to buffering build shows better index quality
results in all the tests. While it's hard for me to explain that.

Hmm, makes me a bit uneasy that we're testing with a modified page
splitting algorithm. But if the new algorithm is that good, could you
post that as a separate patch, please?

That said, I don't see any new evidence that the buffering build
algorithm would be significantly worse. There's the case of ordered data
that we already knew about, and will have to just accept for now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#115Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#112)
1 attachment(s)
Re: WIP: Fast GiST index build

On 22.08.2011 13:23, Alexander Korotkov wrote:

On Wed, Aug 17, 2011 at 11:11 AM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

On 16.08.2011 22:10, Heikki Linnakangas wrote:

Here's an version of the patch with a bunch of minor changes:

And here it really is, this time with an attachment...

Thanks a lot. I'm going to start rerunning the tests now.

First bunch of test results will be available soon (tests running and
results processing take some time). While there is a patch with few small
bugfixes.

I've been mulling this through, and will continue working on this
tomorrow, but wanted to share this version meanwhile:

* Moved all the buffering build logic from gistplacetopage() to a new
function in gistbuild.c. There's almost no changes to gistplacetopage()
now, it returns the SplitInfo struct as usual, and the new function
deals with that and handles the call to
gistRelocateBuildBuffersOnSplit(), and the recursion to insert downlinks.

* Simplified the handling of buffersOnLevels lists a bit. There's now an
entry in buffersOnLevels array for all levels, even those that don't
have buffers because levelStep > 1. That wastes a few bytes in the
array, but it's more easy to debug and understand that way. Also,
there's no separate Len and Count variables for it anymore.

* Moved validateBufferingOption() to gistbuild.c

* Moved the code to add buffer to emptying queue to
gistPushItupToNodeBuffer() (was handled by the callers previously)

* Removed gistGetNodeBufferBusySize(), it was unused

* A lot of comment changes

Could you share the test scripts, patches and data sets etc. needed to
reproduce the tests you've been running? I'd like to try them out on a
test server.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

gist_fast_build-0.14.2-heikki-1.patchtext/x-diff; name=gist_fast_build-0.14.2-heikki-1.patchDownload
*** a/doc/src/sgml/gist.sgml
--- b/doc/src/sgml/gist.sgml
***************
*** 642,647 **** my_distance(PG_FUNCTION_ARGS)
--- 642,679 ----
  
    </variablelist>
  
+  <sect2 id="gist-buffering-build">
+   <title>GiST buffering build</title>
+   <para>
+    Building large GiST indexes by simply inserting all the tuples tends to be
+    slow, because if the index tuples are scattered across the index and the
+    index is large enough to not fit in cache, the insertions need to perform
+    a lot of random I/O. PostgreSQL from version 9.2 supports a more efficient
+    method to build GiST indexes based on buffering, which can dramatically
+    reduce number of random I/O needed for non-ordered data sets. For
+    well-ordered datasets the benefit is smaller or non-existent, because
+    only a small number of pages receive new tuples at a time, and those pages
+    fit in cache even if the index as whole does not.
+   </para>
+ 
+   <para>
+    However, buffering index build needs to call the <function>penalty</>
+    function more often, which consumes some extra CPU resources. Also, it can
+    infuence the quality of the produced index, in both positive and negative
+    directions. That influence depends on various factors, like the
+    distribution of the input data and operator class implementation.
+   </para>
+ 
+   <para>
+    By default, the index build switches to the buffering method when the
+    index size reaches <xref linkend="guc-effective-cache-size">. It can
+    be manually turned on or off by the <literal>BUFFERING</literal> parameter
+    to the CREATE INDEX clause. The default behavior is good for most cases,
+    but turning buffering off might speed up the build somewhat if the input
+    data is ordered.
+   </para>
+ 
+  </sect2>
  </sect1>
  
  <sect1 id="gist-examples">
*** a/doc/src/sgml/ref/create_index.sgml
--- b/doc/src/sgml/ref/create_index.sgml
***************
*** 341,346 **** CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ <replaceable class="parameter">name</
--- 341,366 ----
     </varlistentry>
  
     </variablelist>
+    <para>
+     GiST indexes additionaly accepts parameters:
+    </para>
+ 
+    <variablelist>
+ 
+    <varlistentry>
+     <term><literal>BUFFERING</></term>
+     <listitem>
+     <para>
+      Determines whether the buffering build technique described in
+      <xref linkend="gist-buffering-build"> is used to build the index. With
+      <literal>OFF</> it is disabled, with <literal>ON</> it is enabled, and
+      with <literal>AUTO</> it is initially disabled, but turned on
+      on-the-fly once the index size reaches <xref linkend="guc-effective-cache-size">. The default is <literal>AUTO</>.
+     </para>
+     </listitem>
+    </varlistentry>
+ 
+    </variablelist>
    </refsect2>
  
    <refsect2 id="SQL-CREATEINDEX-CONCURRENTLY">
*** a/src/backend/access/common/reloptions.c
--- b/src/backend/access/common/reloptions.c
***************
*** 219,224 **** static relopt_real realRelOpts[] =
--- 219,235 ----
  
  static relopt_string stringRelOpts[] =
  {
+ 	{
+ 		{
+ 			"buffering",
+ 			"Enables buffering build for this GiST index",
+ 			RELOPT_KIND_GIST
+ 		},
+ 		4,
+ 		false,
+ 		gistValidateBufferingOption,
+ 		"auto"
+ 	},
  	/* list terminator */
  	{{NULL}}
  };
*** a/src/backend/access/gist/Makefile
--- b/src/backend/access/gist/Makefile
***************
*** 13,18 **** top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
  OBJS = gist.o gistutil.o gistxlog.o gistvacuum.o gistget.o gistscan.o \
!        gistproc.o gistsplit.o
  
  include $(top_srcdir)/src/backend/common.mk
--- 13,18 ----
  include $(top_builddir)/src/Makefile.global
  
  OBJS = gist.o gistutil.o gistxlog.o gistvacuum.o gistget.o gistscan.o \
!        gistproc.o gistsplit.o gistbuild.o gistbuildbuffers.o
  
  include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/access/gist/README
--- b/src/backend/access/gist/README
***************
*** 24,29 **** The current implementation of GiST supports:
--- 24,30 ----
    * provides NULL-safe interface to GiST core
    * Concurrency
    * Recovery support via WAL logging
+   * Buffering build algorithm
  
  The support for concurrency implemented in PostgreSQL was developed based on
  the paper "Access Methods for Next-Generation Database Systems" by
***************
*** 31,36 **** Marcel Kornaker:
--- 32,43 ----
  
      http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz
  
+ Buffering build algorithm for GiST was developed based on the paper "Efficient
+ Bulk Operations on Dynamic R-trees" by Lars Arge, Klaus Hinrichs, Jan Vahrenhold
+ and Jeffrey Scott Vitter.
+ 
+     http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9894&rep=rep1&type=pdf
+ 
  The original algorithms were modified in several ways:
  
  * They had to be adapted to PostgreSQL conventions. For example, the SEARCH
***************
*** 278,283 **** would complicate the insertion algorithm. So when an insertion sees a page
--- 285,418 ----
  with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
  crashed in the middle to completion by adding the downlink in the parent.
  
+ Buffering build algorithm
+ -------------------------
+ 
+ In the buffering index build algorithm, some or all internal nodes have a
+ buffer attached to them. When a tuple is inserted at the top, the descend down
+ the tree is stopped as soon as a buffer is reached, and the tuple is pushed to
+ the buffer. When a buffer gets too full, all the tuples in it are flushed to
+ the lower level, where they again hit lower level buffers or leaf pages. This
+ makes the insertions happen in more of a breadth-first than depth-first order,
+ which greatly reduces the amount of random I/O required.
+ 
+ In the algorithm, levels are numbered so that leaf pages have level zero,
+ and internal node levels count up from 1. This numbering ensures that a page's
+ level number never changes, even when the root page is split.
+ 
+ Level                    Tree
+ 
+ 3                         *
+                       /       \
+ 2                *                 *
+               /  |  \           /  |  \
+ 1          *     *     *     *     *     *
+           / \   / \   / \   / \   / \   / \
+ 0        o   o o   o o   o o   o o   o o   o
+ 
+ * - internal page
+ o - leaf page
+ 
+ Internal pages that belong to certain levels have buffers associated with
+ them. Leaf pages never have buffers. Which levels have buffers is controlled
+ by "level step" parameter: level numbers that are multiples of level_step
+ have buffers, while others do not. For example, if level_step = 2, then
+ pages on levels 2, 4, 6, ... have buffers. If level_step = 1 then every
+ internal page has a buffer.
+ 
+ Level        Tree (level_step = 1)                Tree (level_step = 2)
+ 
+ 3                      *(b)                                  *
+                    /       \                             /       \
+ 2             *(b)              *(b)                *(b)              *(b)
+            /  |  \           /  |  \             /  |  \           /  |  \
+ 1       *(b)  *(b)  *(b)  *(b)  *(b)  *(b)    *     *     *     *     *     *
+        / \   / \   / \   / \   / \   / \     / \   / \   / \   / \   / \   / \
+ 0     o   o o   o o   o o   o o   o o   o   o   o o   o o   o o   o o   o o   o
+ 
+ (b) - buffer
+ 
+ Logically, a buffer is just bunch of tuples. Physically, it is divided in
+ pages, backed by a temporary file. Each buffer can be in one of two states:
+ a) Last page of the buffer is kept in main memory. A node buffer is
+ automatically switched to this state when a new index tuple is added to it,
+ or a tuple is removed from it.
+ b) All pages of the buffer are swapped out to disk. When a buffer becomes too
+ full, and we start to flush it, all other buffers are switched to this state.
+ 
+ When an index tuple is inserted, its initial processing can end in one of the
+ following points:
+ 1) Leaf page, if the depth of the index <= level_step, meaning that
+    none of the internal pages have buffers associated with them.
+ 2) Buffer of topmost level page that has buffers.
+ 
+ New index tuples are processed until one of the buffers in the topmost
+ buffered level becomes half-full. When a buffer becomes half-full, it's added
+ to the emptying queue, and will be emptied before a new tuple is processed.
+ 
+ Buffer emptying process means that index tuples from the buffer are moved
+ into buffers at a lower level, or leaf pages. First, all the other buffers are
+ swapped to disk to free up the memory. Then tuples are popped from the buffer
+ one by one, and cascaded down the tree to the next buffer or leaf page below
+ the buffered node.
+ 
+ Emptying a buffer has the interesting dynamic property that any intermediate
+ pages between the buffer being emptied, and the next buffered or leaf level
+ below it, become cached. If there are no more buffers below the node, the leaf
+ pages where the tuples finally land on get cached too. If there are, the last
+ buffer page of each buffer below is kept in memory. This is illustrated in
+ the figures below:
+ 
+    Buffer being emptied to
+      lower-level buffers               Buffer being emptied to leaf pages
+ 
+                +(fb)                                 +(fb)
+             /     \                                /     \
+         +             +                        +             +
+       /   \         /   \                    /   \         /   \
+     *(ab)   *(ab) *(ab)   *(ab)            x       x     x       x
+ 
+ +    - cached internal page
+ x    - cached leaf page
+ *    - non-cached internal page
+ (fb) - buffer being emptied
+ (ab) - buffers being appended to, with last page in memory
+ 
+ In the beginning of the index build, the level-step is chosen so that all those
+ pages involved in emptying one buffer fit in cache, so after each of those
+ pages have been accessed once and cached, emptying a buffer doesn't involve
+ any more I/O. This locality is where the speedup of the buffering algorithm
+ comes from.
+ 
+ Emptying one buffer can fill up one or more of the lower-level buffers,
+ triggering emptying of them as well. Whenever a buffer becomes too full, it's
+ added to the emptying queue, and will be emptied after the current buffer has
+ been processed.
+ 
+ To keep the size of each buffer limited even in the worst case, buffer emptying
+ is scheduled as soon as a buffer becomes half-full, and emptying it continues
+ until 1/2 of the nominal buffer size worth of tuples has been emptied. This
+ guarantees that when buffer emptying begins, all the lower-level buffers
+ are at most half-full. In the worst case that all the tuples are cascaded down
+ to the same lower-level buffer, that buffer therefore has enough space to
+ accommodate all the tuples emptied from the upper-level buffer. There is no
+ hard size limit in any of the data structures used, though, so this only needs
+ to be approximate; small overfilling of some buffers doesn't matter.
+ 
+ If an internal page that has a buffer associated with it is split, the buffer
+ needs to be split too. All tuples in the buffer are scanned through and
+ relocated to the correct sibling buffers, using the penalty function to decide
+ which buffer each tuple should go to.
+ 
+ After all tuples from the heap have been processed, there are still some index
+ tuples in the buffers. At this point, final buffer emptying starts. All buffers
+ are emptied in top-down order. This is slightly complicated by the fact that
+ new buffers can be allocated during the emptying, due to page splits. However,
+ the new buffers will always be siblings of buffers that haven't been fully
+ emptied yet; tuples never move upwards in the tree. The final emptying loops
+ through buffers at a given level until all buffers at that level have been
+ emptied, and then moves down to the next level.
+ 
  
  Authors:
  	Teodor Sigaev	<teodor@sigaev.ru>
*** a/src/backend/access/gist/gist.c
--- b/src/backend/access/gist/gist.c
***************
*** 24,56 ****
  #include "utils/memutils.h"
  #include "utils/rel.h"
  
- /* Working state for gistbuild and its callback */
- typedef struct
- {
- 	GISTSTATE	giststate;
- 	int			numindexattrs;
- 	double		indtuples;
- 	MemoryContext tmpCtx;
- } GISTBuildState;
- 
- /* A List of these is used represent a split-in-progress. */
- typedef struct
- {
- 	Buffer		buf;			/* the split page "half" */
- 	IndexTuple	downlink;		/* downlink for this half. */
- } GISTPageSplitInfo;
- 
  /* non-export function prototypes */
- static void gistbuildCallback(Relation index,
- 				  HeapTuple htup,
- 				  Datum *values,
- 				  bool *isnull,
- 				  bool tupleIsAlive,
- 				  void *state);
- static void gistdoinsert(Relation r,
- 			 IndexTuple itup,
- 			 Size freespace,
- 			 GISTSTATE *GISTstate);
  static void gistfixsplit(GISTInsertState *state, GISTSTATE *giststate);
  static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
  				 GISTSTATE *giststate,
--- 24,30 ----
***************
*** 89,226 **** createTempGistContext(void)
  }
  
  /*
-  * Routine to build an index.  Basically calls insert over and over.
-  *
-  * XXX: it would be nice to implement some sort of bulk-loading
-  * algorithm, but it is not clear how to do that.
-  */
- Datum
- gistbuild(PG_FUNCTION_ARGS)
- {
- 	Relation	heap = (Relation) PG_GETARG_POINTER(0);
- 	Relation	index = (Relation) PG_GETARG_POINTER(1);
- 	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
- 	IndexBuildResult *result;
- 	double		reltuples;
- 	GISTBuildState buildstate;
- 	Buffer		buffer;
- 	Page		page;
- 
- 	/*
- 	 * We expect to be called exactly once for any index relation. If that's
- 	 * not the case, big trouble's what we have.
- 	 */
- 	if (RelationGetNumberOfBlocks(index) != 0)
- 		elog(ERROR, "index \"%s\" already contains data",
- 			 RelationGetRelationName(index));
- 
- 	/* no locking is needed */
- 	initGISTstate(&buildstate.giststate, index);
- 
- 	/* initialize the root page */
- 	buffer = gistNewBuffer(index);
- 	Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
- 	page = BufferGetPage(buffer);
- 
- 	START_CRIT_SECTION();
- 
- 	GISTInitBuffer(buffer, F_LEAF);
- 
- 	MarkBufferDirty(buffer);
- 
- 	if (RelationNeedsWAL(index))
- 	{
- 		XLogRecPtr	recptr;
- 		XLogRecData rdata;
- 
- 		rdata.data = (char *) &(index->rd_node);
- 		rdata.len = sizeof(RelFileNode);
- 		rdata.buffer = InvalidBuffer;
- 		rdata.next = NULL;
- 
- 		recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_CREATE_INDEX, &rdata);
- 		PageSetLSN(page, recptr);
- 		PageSetTLI(page, ThisTimeLineID);
- 	}
- 	else
- 		PageSetLSN(page, GetXLogRecPtrForTemp());
- 
- 	UnlockReleaseBuffer(buffer);
- 
- 	END_CRIT_SECTION();
- 
- 	/* build the index */
- 	buildstate.numindexattrs = indexInfo->ii_NumIndexAttrs;
- 	buildstate.indtuples = 0;
- 
- 	/*
- 	 * create a temporary memory context that is reset once for each tuple
- 	 * inserted into the index
- 	 */
- 	buildstate.tmpCtx = createTempGistContext();
- 
- 	/* do the heap scan */
- 	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
- 								   gistbuildCallback, (void *) &buildstate);
- 
- 	/* okay, all heap tuples are indexed */
- 	MemoryContextDelete(buildstate.tmpCtx);
- 
- 	freeGISTstate(&buildstate.giststate);
- 
- 	/*
- 	 * Return statistics
- 	 */
- 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
- 
- 	result->heap_tuples = reltuples;
- 	result->index_tuples = buildstate.indtuples;
- 
- 	PG_RETURN_POINTER(result);
- }
- 
- /*
-  * Per-tuple callback from IndexBuildHeapScan
-  */
- static void
- gistbuildCallback(Relation index,
- 				  HeapTuple htup,
- 				  Datum *values,
- 				  bool *isnull,
- 				  bool tupleIsAlive,
- 				  void *state)
- {
- 	GISTBuildState *buildstate = (GISTBuildState *) state;
- 	IndexTuple	itup;
- 	MemoryContext oldCtx;
- 
- 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
- 
- 	/* form an index tuple and point it at the heap tuple */
- 	itup = gistFormTuple(&buildstate->giststate, index,
- 						 values, isnull, true /* size is currently bogus */ );
- 	itup->t_tid = htup->t_self;
- 
- 	/*
- 	 * Since we already have the index relation locked, we call gistdoinsert
- 	 * directly.  Normal access method calls dispatch through gistinsert,
- 	 * which locks the relation for write.	This is the right thing to do if
- 	 * you're inserting single tups, but not when you're initializing the
- 	 * whole index at once.
- 	 *
- 	 * In this path we respect the fillfactor setting, whereas insertions
- 	 * after initial build do not.
- 	 */
- 	gistdoinsert(index, itup,
- 			  RelationGetTargetPageFreeSpace(index, GIST_DEFAULT_FILLFACTOR),
- 				 &buildstate->giststate);
- 
- 	buildstate->indtuples += 1;
- 	MemoryContextSwitchTo(oldCtx);
- 	MemoryContextReset(buildstate->tmpCtx);
- }
- 
- /*
   *	gistbuildempty() -- build an empty gist index in the initialization fork
   */
  Datum
--- 63,68 ----
***************
*** 293,300 **** gistinsert(PG_FUNCTION_ARGS)
   * In that case, we continue to hold the root page locked, and the child
   * pages are released; note that new tuple(s) are *not* on the root page
   * but in one of the new child pages.
   */
! static bool
  gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
  				Buffer buffer,
  				IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
--- 135,144 ----
   * In that case, we continue to hold the root page locked, and the child
   * pages are released; note that new tuple(s) are *not* on the root page
   * but in one of the new child pages.
+  *
+  * Returns 'true' if the page was split, 'false' otherwise.
   */
! bool
  gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
  				Buffer buffer,
  				IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
***************
*** 474,480 **** gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
  			else
  				GistPageGetOpaque(ptr->page)->rightlink = oldrlink;
  
! 			if (ptr->next && !is_rootsplit)
  				GistMarkFollowRight(ptr->page);
  			else
  				GistClearFollowRight(ptr->page);
--- 318,332 ----
  			else
  				GistPageGetOpaque(ptr->page)->rightlink = oldrlink;
  
! 			/*
! 			 * Mark the all but the right-most page with the follow-right
! 			 * flag. It will be cleared as soon as the downlink is inserted
! 			 * into the parent, but this ensures that if we error out before
! 			 * that, the index is still consistent. (in buffering build mode,
! 			 * any error will abort the index build anyway, so this is not
! 			 * needed.)
! 			 */
! 			if (ptr->next && !is_rootsplit && !giststate->gfbb)
  				GistMarkFollowRight(ptr->page);
  			else
  				GistClearFollowRight(ptr->page);
***************
*** 508,514 **** gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
  		/* Write the WAL record */
  		if (RelationNeedsWAL(state->r))
  			recptr = gistXLogSplit(state->r->rd_node, blkno, is_leaf,
! 								   dist, oldrlink, oldnsn, leftchildbuf);
  		else
  			recptr = GetXLogRecPtrForTemp();
  
--- 360,367 ----
  		/* Write the WAL record */
  		if (RelationNeedsWAL(state->r))
  			recptr = gistXLogSplit(state->r->rd_node, blkno, is_leaf,
! 								   dist, oldrlink, oldnsn, leftchildbuf,
! 								   giststate->gfbb ? true : false);
  		else
  			recptr = GetXLogRecPtrForTemp();
  
***************
*** 570,577 **** gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
  			recptr = GetXLogRecPtrForTemp();
  			PageSetLSN(page, recptr);
  		}
- 
- 		*splitinfo = NIL;
  	}
  
  	/*
--- 423,428 ----
***************
*** 608,614 **** gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
   * this routine assumes it is invoked in a short-lived memory context,
   * so it does not bother releasing palloc'd allocations.
   */
! static void
  gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
  {
  	ItemId		iid;
--- 459,465 ----
   * this routine assumes it is invoked in a short-lived memory context,
   * so it does not bother releasing palloc'd allocations.
   */
! void
  gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
  {
  	ItemId		iid;
***************
*** 1414,1419 **** initGISTstate(GISTSTATE *giststate, Relation index)
--- 1265,1271 ----
  		else
  			giststate->supportCollation[i] = DEFAULT_COLLATION_OID;
  	}
+ 	giststate->gfbb = NULL;
  }
  
  void
*** /dev/null
--- b/src/backend/access/gist/gistbuild.c
***************
*** 0 ****
--- 1,1066 ----
+ /*-------------------------------------------------------------------------
+  *
+  * gistbuild.c
+  *	  build algorithm for GiST indexes implementation.
+  *
+  *
+  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * IDENTIFICATION
+  *	  src/backend/access/gist/gistbuild.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+ 
+ #include "access/genam.h"
+ #include "access/gist_private.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_collation.h"
+ #include "miscadmin.h"
+ #include "optimizer/cost.h"
+ #include "storage/bufmgr.h"
+ #include "storage/indexfsm.h"
+ #include "storage/smgr.h"
+ #include "utils/memutils.h"
+ #include "utils/rel.h"
+ 
+ /* Step of index tuples for check whether to switch to buffering build mode */
+ #define BUFFERING_MODE_SWITCH_CHECK_STEP 256
+ 
+ /*
+  * Number of tuples to process in the slow way before switching to buffering
+  * mode, when buffering is explicitly turned on. Also, the number of tuples
+  * to process between readjusting the buffer size parameter, while in
+  * buffering mode.
+  */
+ #define BUFFERING_MODE_TUPLE_SIZE_STATS_TARGET 4096
+ 
+ typedef enum
+ {
+ 	GIST_BUFFERING_DISABLED,	/* in regular build mode and aren't going to
+ 								 * switch */
+ 	GIST_BUFFERING_AUTO,		/* in regular build mode, but will switch to
+ 								 * buffering build mode if the index grows
+ 								 * too big */
+ 	GIST_BUFFERING_STATS,		/* gathering statistics of index tuple size
+ 								 * before switching to the buffering build
+ 								 * mode */
+ 	GIST_BUFFERING_ACTIVE		/* in buffering build mode */
+ } GistBufferingMode;
+ 
+ /* Working state for gistbuild and its callback */
+ typedef struct
+ {
+ 	GISTSTATE	giststate;
+ 	int64		indtuples;
+ 	int64		indtuplesSize;
+ 
+ 	Size		freespace;	/* Amount of free space to leave on pages */
+ 
+ 	GistBufferingMode bufferingMode;
+ 	MemoryContext tmpCtx;
+ } GISTBuildState;
+ 
+ static void gistFreeUnreferencedPath(GISTBufferingInsertStack *path);
+ static bool gistProcessItup(GISTSTATE *giststate, GISTInsertState *state,
+ 				GISTBuildBuffers *gfbb, IndexTuple itup,
+ 				GISTBufferingInsertStack *startparent);
+ static void gistProcessEmptyingStack(GISTSTATE *giststate, GISTInsertState *state);
+ static void gistBufferingBuildInsert(Relation index, IndexTuple itup,
+ 						 GISTBuildState *buildstate);
+ static void gistBuildCallback(Relation index,
+ 				  HeapTuple htup,
+ 				  Datum *values,
+ 				  bool *isnull,
+ 				  bool tupleIsAlive,
+ 				  void *state);
+ static int	gistGetMaxLevel(Relation index);
+ static bool gistInitBuffering(GISTBuildState *buildstate, Relation index);
+ static int	calculatePagesPerBuffer(GISTBuildState *buildstate, Relation index,
+ 						int levelStep);
+ static void gistbufferinginserttuples(GISTInsertState *state, GISTSTATE *giststate,
+ 				Buffer buffer,
+ 				IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
+ 				GISTBufferingInsertStack *path);
+ static void gistBufferingFindCorrectParent(GISTSTATE *giststate, Relation r,
+ 							   GISTBufferingInsertStack *child);
+ 
+ /*
+  * Main entry point to GiST indexbuild. Initially calls insert over and over, 
+  * but switches to more efficient buffering build algorithm after a certain
+  * number of tuples (unless buffering mode is disabled).
+  */
+ Datum
+ gistbuild(PG_FUNCTION_ARGS)
+ {
+ 	Relation	heap = (Relation) PG_GETARG_POINTER(0);
+ 	Relation	index = (Relation) PG_GETARG_POINTER(1);
+ 	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+ 	IndexBuildResult *result;
+ 	double		reltuples;
+ 	GISTBuildState buildstate;
+ 	Buffer		buffer;
+ 	Page		page;
+ 	MemoryContext oldcxt = CurrentMemoryContext;
+ 
+ 	buildstate.freespace = RelationGetTargetPageFreeSpace(index,
+ 													  GIST_DEFAULT_FILLFACTOR);
+ 
+ 	if (index->rd_options)
+ 	{
+ 		/* Get buffering mode from the options string */
+ 		GiSTOptions *options = (GiSTOptions *) index->rd_options;
+ 		char	   *bufferingMode = (char *) options + options->bufferingModeOffset;
+ 
+ 		if (strcmp(bufferingMode, "on") == 0)
+ 			buildstate.bufferingMode = GIST_BUFFERING_STATS;
+ 		else if (strcmp(bufferingMode, "off") == 0)
+ 			buildstate.bufferingMode = GIST_BUFFERING_DISABLED;
+ 		else
+ 			buildstate.bufferingMode = GIST_BUFFERING_AUTO;
+ 	}
+ 	else
+ 	{
+ 		/* Automatic buffering mode switching by default */
+ 		buildstate.bufferingMode = GIST_BUFFERING_AUTO;
+ 	}
+ 
+ 	/*
+ 	 * We expect to be called exactly once for any index relation. If that's
+ 	 * not the case, big trouble's what we have.
+ 	 */
+ 	if (RelationGetNumberOfBlocks(index) != 0)
+ 		elog(ERROR, "index \"%s\" already contains data",
+ 			 RelationGetRelationName(index));
+ 
+ 	/* no locking is needed */
+ 	initGISTstate(&buildstate.giststate, index);
+ 
+ 	/* initialize the root page */
+ 	buffer = gistNewBuffer(index);
+ 	Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
+ 	page = BufferGetPage(buffer);
+ 
+ 	START_CRIT_SECTION();
+ 
+ 	GISTInitBuffer(buffer, F_LEAF);
+ 
+ 	MarkBufferDirty(buffer);
+ 
+ 	if (RelationNeedsWAL(index))
+ 	{
+ 		XLogRecPtr	recptr;
+ 		XLogRecData rdata;
+ 
+ 		rdata.data = (char *) &(index->rd_node);
+ 		rdata.len = sizeof(RelFileNode);
+ 		rdata.buffer = InvalidBuffer;
+ 		rdata.next = NULL;
+ 
+ 		recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_CREATE_INDEX, &rdata);
+ 		PageSetLSN(page, recptr);
+ 		PageSetTLI(page, ThisTimeLineID);
+ 	}
+ 	else
+ 		PageSetLSN(page, GetXLogRecPtrForTemp());
+ 
+ 	UnlockReleaseBuffer(buffer);
+ 
+ 	END_CRIT_SECTION();
+ 
+ 	/* build the index */
+ 	buildstate.indtuples = 0;
+ 	buildstate.indtuplesSize = 0;
+ 
+ 	/*
+ 	 * create a temporary memory context that is reset once for each tuple
+ 	 * processed.
+ 	 */
+ 	buildstate.tmpCtx = createTempGistContext();
+ 
+ 	/*
+ 	 * Do the heap scan.
+ 	 */
+ 	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
+ 								   gistBuildCallback, (void *) &buildstate);
+ 
+ 	/*
+ 	 * If buffering build was used, flush out all the tuples that are still
+ 	 * in the buffers.
+ 	 */
+ 	if (buildstate.bufferingMode == GIST_BUFFERING_ACTIVE)
+ 	{
+ 		int			i;
+ 		GISTInsertState insertstate;
+ 		GISTNodeBuffer *nodeBuffer;
+ 		MemoryContext oldCtx;
+ 		GISTBuildBuffers *gfbb = buildstate.giststate.gfbb;
+ 
+ 		elog(DEBUG1, "all tuples processed, emptying buffers");
+ 
+ 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+ 
+ 		memset(&insertstate, 0, sizeof(GISTInsertState));
+ 		insertstate.freespace = buildstate.freespace;
+ 		insertstate.r = index;
+ 
+ 		/*
+ 		 * Iterate through the levels from the most higher.
+ 		 */
+ 		for (i = gfbb->buffersOnLevelsLen - 1; i >= 0; i--)
+ 		{
+ 			bool		nonEmpty = true;
+ 
+ 			/*
+ 			 * Empty all buffers on this level. We repeatedly loop through all
+ 			 * the buffers on this level, until we observe that all the
+ 			 * buffers are empty. Looping through the list once is not enough,
+ 			 * because emptying one buffer can cause pages to split and new
+ 			 * buffers to be created on the same (and lower) level.
+ 			 */
+ 			while (nonEmpty)
+ 			{
+ 				ListCell   *p;
+ 
+ 				nonEmpty = false;
+ 
+ 				for (p = list_head(gfbb->buffersOnLevels[i]); p; p = p->next)
+ 				{
+ 					bool		isRoot;
+ 
+ 					/* Get next node buffer */
+ 					nodeBuffer = (GISTNodeBuffer *) p->data.ptr_value;
+ 					isRoot = (nodeBuffer->nodeBlocknum == GIST_ROOT_BLKNO);
+ 
+ 					/* Skip empty node buffer */
+ 					if (nodeBuffer->blocksCount == 0)
+ 						continue;
+ 
+ 					/* Memorize that we saw a non-empty buffer. */
+ 					nonEmpty = true;
+ 
+ 					/* Process emptying of node buffer */
+ 					MemoryContextSwitchTo(gfbb->context);
+ 					gfbb->bufferEmptyingQueue = lcons(nodeBuffer, gfbb->bufferEmptyingQueue);
+ 					MemoryContextSwitchTo(buildstate.tmpCtx);
+ 					gistProcessEmptyingStack(&buildstate.giststate, &insertstate);
+ 
+ 					/*
+ 					 * Root page node buffer is the only node buffer that can
+ 					 * be deleted from the list. So, let's be careful and
+ 					 * restart the scan.
+ 					 */
+ 					if (isRoot)
+ 						break;
+ 				}
+ 			}
+ 		}
+ 		MemoryContextSwitchTo(oldCtx);
+ 	}
+ 
+ 	/* okay, all heap tuples are indexed */
+ 	MemoryContextSwitchTo(oldcxt);
+ 	MemoryContextDelete(buildstate.tmpCtx);
+ 
+ 	freeGISTstate(&buildstate.giststate);
+ 
+ 	/*
+ 	 * Return statistics
+ 	 */
+ 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+ 
+ 	result->heap_tuples = reltuples;
+ 	result->index_tuples = (double) buildstate.indtuples;
+ 
+ 	PG_RETURN_POINTER(result);
+ }
+ 
+ 
+ /*
+  * Validator for "buffering" reloption on GiST indexes. Allows "on", "off"
+  * and "auto" values.
+  */
+ void
+ gistValidateBufferingOption(char *value)
+ {
+ 	if (value == NULL ||
+ 		(strcmp(value, "on") != 0 &&
+ 		 strcmp(value, "off") != 0 &&
+ 		 strcmp(value, "auto") != 0))
+ 	{
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 				 errmsg("invalid value for \"buffering\" option"),
+ 				 errdetail("Valid values are \"on\", \"off\" and \"auto\".")));
+ 	}
+ }
+ 
+ /*
+  * Free unreferenced parts of a path stack.
+  */
+ static void
+ gistFreeUnreferencedPath(GISTBufferingInsertStack *path)
+ {
+ 	while (path->refCount == 0)
+ 	{
+ 		/*
+ 		 * Path part is unreferenced. We can free it and decrease reference
+ 		 * count of parent. If parent becomes unreferenced too procedure
+ 		 * should be repeated for it.
+ 		 */
+ 		GISTBufferingInsertStack *tmp = path->parent;
+ 
+ 		pfree(path);
+ 		path = tmp;
+ 		if (path)
+ 			path->refCount--;
+ 		else
+ 			break;
+ 	}
+ }
+ 
+ /*
+  * Decrease reference count of path part, and free any unreferenced parts of
+  * the path stack.
+  */
+ void
+ gistDecreasePathRefcount(GISTBufferingInsertStack *path)
+ {
+ 	path->refCount--;
+ 	gistFreeUnreferencedPath(path);
+ }
+ 
+ /*
+  * Process an index tuple. Runs the tuple down the tree until we reach a leaf
+  * page or node buffer, and inserts the tuple there. Returns true if we have
+  * to stop buffer emptying process (because one of child buffers can't take
+  * index tuples anymore).
+  */
+ static bool
+ gistProcessItup(GISTSTATE *giststate, GISTInsertState *state,
+ 				GISTBuildBuffers *gfbb, IndexTuple itup,
+ 				GISTBufferingInsertStack *startparent)
+ {
+ 	GISTBufferingInsertStack *path;
+ 	BlockNumber childblkno;
+ 	Buffer		buffer;
+ 	bool		result = false;
+ 
+ 	/*
+ 	 * NULL passed in startparent means that we start index tuple processing
+ 	 * from the root.
+ 	 */
+ 	if (!startparent)
+ 		path = gfbb->rootitem;
+ 	else
+ 		path = startparent;
+ 
+ 	/*
+ 	 * Loop until we reach a leaf page (level == 0) or a level with buffers
+ 	 * (not including the level we start at, because we would otherwise make
+ 	 * no progress).
+ 	 */
+ 	for (;;)
+ 	{
+ 		ItemId		iid;
+ 		IndexTuple	idxtuple,
+ 					newtup;
+ 		Page		page;
+ 		OffsetNumber childoffnum;
+ 		GISTBufferingInsertStack *parent;
+ 
+ 		/* Have we reached a level with buffers? */
+ 		if (LEVEL_HAS_BUFFERS(path->level, gfbb) && path != startparent)
+ 			break;
+ 
+ 		/* Have we reached a leaf page? */
+ 		if (path->level == 0)
+ 			break;
+ 
+ 		/*
+ 		 * Nope. Descend down to the next level then. Choose a child to descend
+ 		 * down to.
+ 		 */
+ 		buffer = ReadBuffer(state->r, path->blkno);
+ 		LockBuffer(buffer, GIST_EXCLUSIVE);
+ 
+ 		page = (Page) BufferGetPage(buffer);
+ 		childoffnum = gistchoose(state->r, page, itup, giststate);
+ 		iid = PageGetItemId(page, childoffnum);
+ 		idxtuple = (IndexTuple) PageGetItem(page, iid);
+ 		childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+ 
+ 		/*
+ 		 * Check that the key representing the target child node is
+ 		 * consistent with the key we're inserting. Update it if it's not.
+ 		 */
+ 		newtup = gistgetadjusted(state->r, idxtuple, itup, giststate);
+ 		if (newtup)
+ 			gistbufferinginserttuples(state, giststate, buffer, &newtup, 1,
+ 									  childoffnum, path);
+ 		UnlockReleaseBuffer(buffer);
+ 
+ 		/* Create new path item representing current page */
+ 		parent = path;
+ 		path = (GISTBufferingInsertStack *) MemoryContextAlloc(gfbb->context,
+ 										   sizeof(GISTBufferingInsertStack));
+ 		path->parent = parent;
+ 		path->level = parent->level - 1;
+ 		path->blkno = childblkno;
+ 		path->downlinkoffnum = childoffnum;
+ 		path->refCount = 0;		/* it's unreferenced for now */
+ 
+ 		/* Adjust reference count of parent */
+ 		if (parent)
+ 			parent->refCount++;
+ 	}
+ 
+ 	if (LEVEL_HAS_BUFFERS(path->level, gfbb))
+ 	{
+ 		/*
+ 		 * We've reached level with buffers. Place the index tuple to the
+ 		 * buffer, and add the buffer to the emptying queue if it overflows.
+ 		 */
+ 		GISTNodeBuffer *childNodeBuffer;
+ 
+ 		/* Find the buffer or create a new one */
+ 		childNodeBuffer = gistGetNodeBuffer(gfbb, giststate, path->blkno,
+ 											path->downlinkoffnum, path->parent);
+ 
+ 		/* Add index tuple to it */
+ 		gistPushItupToNodeBuffer(gfbb, childNodeBuffer, itup);
+ 
+ 		if (BUFFER_OVERFLOWED(childNodeBuffer, gfbb))
+ 			result = true;
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * We've reached a leaf page. Place the tuple here.
+ 		 */
+ 		buffer = ReadBuffer(state->r, path->blkno);
+ 		LockBuffer(buffer, GIST_EXCLUSIVE);
+ 		gistbufferinginserttuples(state, giststate, buffer, &itup, 1,
+ 								  InvalidOffsetNumber, path);
+ 		UnlockReleaseBuffer(buffer);
+ 	}
+ 
+ 	/*
+ 	 * Free unreferenced path items, if any. Path item may be referenced by
+ 	 * node buffer.
+ 	 */
+ 	gistFreeUnreferencedPath(path);
+ 
+ 	return result;
+ }
+ 
+ /*
+  * Insert tuples to a given page.
+  *
+  * This is analogous with gistinserttuples() in the regular insertion code.
+  */
+ static void
+ gistbufferinginserttuples(GISTInsertState *state, GISTSTATE *giststate,
+ 						  Buffer buffer,
+ 						  IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
+ 						  GISTBufferingInsertStack *path)
+ {
+ 	GISTBuildBuffers *gfbb = giststate->gfbb;
+ 	List	   *splitinfo;
+ 	bool		is_split;
+ 
+ 	is_split = gistplacetopage(state, giststate, buffer,
+ 							   itup, ntup, oldoffnum,
+ 							   InvalidBuffer,
+ 							   &splitinfo);
+ 	/*
+ 	 * If this is a root split, update the root path item kept in memory.
+ 	 * This ensures that all path stacks are always complete, including all
+ 	 * parent nodes up to the root. That simplifies the algorithm to re-find
+ 	 * correct parent.
+ 	 */
+ 	if (is_split && BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO)
+ 	{
+ 		GISTBufferingInsertStack *oldroot = gfbb->rootitem;
+ 		Page		page = BufferGetPage(buffer);
+ 		ItemId		iid;
+ 		IndexTuple	idxtuple;
+ 		BlockNumber leftmostchild;
+ 
+ 		gfbb->rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc(
+ 			gfbb->context, sizeof(GISTBufferingInsertStack));
+ 		gfbb->rootitem->parent = NULL;
+ 		gfbb->rootitem->blkno = GIST_ROOT_BLKNO;
+ 		gfbb->rootitem->downlinkoffnum = InvalidOffsetNumber;
+ 		gfbb->rootitem->level = oldroot->level + 1;
+ 		gfbb->rootitem->refCount = 1;
+ 
+ 		/*
+ 		 * All the downlinks on the old root page are now on one of the child
+ 		 * pages. Change the block number of the old root entry in the stack
+ 		 * to point to the leftmost child. The other child pages will be
+ 		 * accessible from there by walking right.
+ 		 */
+ 		iid = PageGetItemId(page, FirstOffsetNumber);
+ 		idxtuple = (IndexTuple) PageGetItem(page, iid);
+ 		leftmostchild = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+ 
+ 		oldroot->parent = gfbb->rootitem;
+ 		oldroot->blkno = leftmostchild;
+ 		oldroot->downlinkoffnum = InvalidOffsetNumber;
+ 	}
+ 
+ 	if (splitinfo)
+ 	{
+ 		/*
+ 		 * Insert the downlinks to the parent. This is analogous with
+ 		 * gistfinishsplit() in the regular insertion code, but the locking
+ 		 * is simpler, and we have to maintain the buffers.
+ 		 */
+ 		IndexTuple *downlinks;
+ 		int			ndownlinks,
+ 					i;
+ 		Buffer		parentBuffer;
+ 		ListCell   *lc;
+ 
+ 		/* Parent may have changed since we memorized this path. */
+ 		gistBufferingFindCorrectParent(giststate, state->r, path);
+ 
+ 		/*
+ 		 * If there's a buffer associated with this page, that needs to
+ 		 * be split too. gistRelocateBuildBuffersOnSplit() will also adjust
+ 		 * the downlinks in 'splitinfo', to make sure they're consistent not
+ 		 * only with the tuples already on the pages, but also the tuples in
+ 		 * the buffers that will eventually be inserted to them.
+ 		 */
+ 		gistRelocateBuildBuffersOnSplit(gfbb, giststate, state->r,
+ 										path, buffer, splitinfo);
+ 
+ 		/* Create an array of all the downlink tuples */
+ 		ndownlinks = list_length(splitinfo);
+ 		downlinks = (IndexTuple *) palloc(sizeof(IndexTuple) * ndownlinks);
+ 		i = 0;
+ 		foreach(lc, splitinfo)
+ 		{
+ 			GISTPageSplitInfo *splitinfo = lfirst(lc);
+ 
+ 			/*
+ 			 * Since there's no concurrent access, we can release the lower
+ 			 * level buffers immediately. Don't release the buffer for the
+ 			 * original page, though, because the caller will release that.
+ 			 */
+ 			if (splitinfo->buf != buffer)
+ 				UnlockReleaseBuffer(splitinfo->buf);
+ 			downlinks[i++] = splitinfo->downlink;
+ 		}
+ 
+ 		/* Insert them into parent. */
+ 		parentBuffer = ReadBuffer(state->r, path->parent->blkno);
+ 		LockBuffer(parentBuffer, GIST_EXCLUSIVE);
+ 		gistbufferinginserttuples(state, giststate, parentBuffer,
+ 								  downlinks, ndownlinks,
+ 								  path->downlinkoffnum, path->parent);
+ 		UnlockReleaseBuffer(parentBuffer);
+ 
+ 		list_free_deep(splitinfo);		/* we don't need this anymore */
+ 	}
+ }
+ 
+ /*
+  * Find correct parent by following rightlinks in buffering index build. This
+  * method of parent searching is possible because no concurrent activity is
+  * possible while index builds.
+  */
+ static void
+ gistBufferingFindCorrectParent(GISTSTATE *giststate, Relation r,
+ 							   GISTBufferingInsertStack *child)
+ {
+ 	GISTBuildBuffers *gfbb = giststate->gfbb;
+ 	GISTBufferingInsertStack *parent = child->parent;
+ 	OffsetNumber i,
+ 				maxoff;
+ 	ItemId		iid;
+ 	IndexTuple	idxtuple;
+ 	Buffer		buffer;
+ 	Page		page;
+ 	bool		copied = false;
+ 
+ 	buffer = ReadBuffer(r, parent->blkno);
+ 	page = BufferGetPage(buffer);
+ 	LockBuffer(buffer, GIST_EXCLUSIVE);
+ 	gistcheckpage(r, buffer);
+ 
+ 	/* Check if it was not moved */
+ 	if (child->downlinkoffnum != InvalidOffsetNumber)
+ 	{
+ 		iid = PageGetItemId(page, child->downlinkoffnum);
+ 		idxtuple = (IndexTuple) PageGetItem(page, iid);
+ 		if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == child->blkno)
+ 		{
+ 			/* Still there */
+ 			UnlockReleaseBuffer(buffer);
+ 			return;
+ 		}
+ 	}
+ 
+ 	/* parent has changed, look child in right links until found */
+ 	while (true)
+ 	{
+ 		/* Search for relevant downlink in the current page */
+ 		maxoff = PageGetMaxOffsetNumber(page);
+ 		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+ 		{
+ 			iid = PageGetItemId(page, i);
+ 			idxtuple = (IndexTuple) PageGetItem(page, iid);
+ 			if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == child->blkno)
+ 			{
+ 				/* yes!!, found */
+ 				child->downlinkoffnum = i;
+ 				UnlockReleaseBuffer(buffer);
+ 				return;
+ 			}
+ 		}
+ 
+ 		/*
+ 		 * We should copy parent path item because some other path items can
+ 		 * refer to it.
+ 		 */
+ 		if (!copied)
+ 		{
+ 			parent = (GISTBufferingInsertStack *) MemoryContextAlloc(gfbb->context,
+ 										   sizeof(GISTBufferingInsertStack));
+ 			memcpy(parent, child->parent, sizeof(GISTBufferingInsertStack));
+ 			if (parent->parent)
+ 				parent->parent->refCount++;
+ 			gistDecreasePathRefcount(child->parent);
+ 			child->parent = parent;
+ 			parent->refCount = 1;
+ 			copied = true;
+ 		}
+ 
+ 		/*
+ 		 * Not found in current page. Move towards rightlink.
+ 		 */
+ 		parent->blkno = GistPageGetOpaque(page)->rightlink;
+ 		UnlockReleaseBuffer(buffer);
+ 
+ 		if (parent->blkno == InvalidBlockNumber)
+ 		{
+ 			/*
+ 			 * End of chain and still didn't find parent. Should not happen
+ 			 * during index build.
+ 			 */
+ 			break;
+ 		}
+ 
+ 		/* Get the next page */
+ 		buffer = ReadBuffer(r, parent->blkno);
+ 		page = BufferGetPage(buffer);
+ 		LockBuffer(buffer, GIST_EXCLUSIVE);
+ 		gistcheckpage(r, buffer);
+ 	}
+ 
+ 	elog(ERROR, "failed to re-find parent for block %u", child->blkno);
+ }
+ 
+ /*
+  * Process buffers emptying stack. Emptying of one buffer can cause emptying
+  * of other buffers. This function iterates until this cascading emptying
+  * process finished, e.g. until buffers emptying stack is empty.
+  */
+ static void
+ gistProcessEmptyingStack(GISTSTATE *giststate, GISTInsertState *state)
+ {
+ 	GISTBuildBuffers *gfbb = giststate->gfbb;
+ 
+ 	/* Iterate while we have elements in buffers emptying stack. */
+ 	while (gfbb->bufferEmptyingQueue != NIL)
+ 	{
+ 		GISTNodeBuffer *emptyingNodeBuffer;
+ 
+ 		/* Get node buffer from emptying stack. */
+ 		emptyingNodeBuffer = (GISTNodeBuffer *) linitial(gfbb->bufferEmptyingQueue);
+ 		gfbb->bufferEmptyingQueue = list_delete_first(gfbb->bufferEmptyingQueue);
+ 		emptyingNodeBuffer->queuedForEmptying = false;
+ 
+ 		/*
+ 		 * We are going to load last pages of buffers where emptying will be
+ 		 * to. So let's unload any previously loaded buffers.
+ 		 */
+ 		gistUnloadNodeBuffers(gfbb);
+ 
+ 		/* Variables for split of current emptying buffer detection. */
+ 		gfbb->currentEmptyingBufferSplit = false;
+ 		gfbb->currentEmptyingBufferBlockNumber = emptyingNodeBuffer->nodeBlocknum;
+ 
+ 		while (true)
+ 		{
+ 			IndexTuple	itup;
+ 
+ 			/* Get next index tuple from the buffer */
+ 			if (!gistPopItupFromNodeBuffer(gfbb, emptyingNodeBuffer, &itup))
+ 				break;
+ 
+ 			/* Run it down to the underlying node buffer or leaf page */
+ 			if (gistProcessItup(giststate, state, gfbb, itup, emptyingNodeBuffer->path))
+ 				break;
+ 
+ 			/* Free all the memory allocated during index tuple processing */
+ 			MemoryContextReset(CurrentMemoryContext);
+ 
+ 			/*
+ 			 * If current emptying node buffer split, we have to stop emptying
+ 			 * it, because the buffer might not exist anymore.
+ 			 */
+ 			if (gfbb->currentEmptyingBufferSplit)
+ 				break;
+ 		}
+ 	}
+ }
+ 
+ /*
+  * Insert function for buffering index build.
+  */
+ static void
+ gistBufferingBuildInsert(Relation index, IndexTuple itup,
+ 						 GISTBuildState *buildstate)
+ {
+ 	GISTBuildBuffers *gfbb = buildstate->giststate.gfbb;
+ 	GISTInsertState insertstate;
+ 
+ 	memset(&insertstate, 0, sizeof(GISTInsertState));
+ 	insertstate.freespace = buildstate->freespace;
+ 	insertstate.r = index;
+ 
+ 	/* We are ready for index tuple processing */
+ 	gistProcessItup(&buildstate->giststate, &insertstate, gfbb, itup, NULL);
+ 
+ 	/* Process buffer emptying stack if any */
+ 	gistProcessEmptyingStack(&buildstate->giststate, &insertstate);
+ }
+ 
+ /*
+  * Per-tuple callback from IndexBuildHeapScan.
+  */
+ static void
+ gistBuildCallback(Relation index,
+ 				  HeapTuple htup,
+ 				  Datum *values,
+ 				  bool *isnull,
+ 				  bool tupleIsAlive,
+ 				  void *state)
+ {
+ 	GISTBuildState *buildstate = (GISTBuildState *) state;
+ 	IndexTuple	itup;
+ 	MemoryContext oldCtx;
+ 
+ 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+ 
+ 	/* form an index tuple and point it at the heap tuple */
+ 	itup = gistFormTuple(&buildstate->giststate, index, values, isnull, true);
+ 	itup->t_tid = htup->t_self;
+ 
+ 	if (buildstate->bufferingMode == GIST_BUFFERING_ACTIVE)
+ 	{
+ 		/* We have buffers, so use them. */
+ 		gistBufferingBuildInsert(index, itup, buildstate);
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * There's no buffers (yet). Since we already have the index relation
+ 		 * locked, we call gistdoinsert directly.
+ 		 *
+ 		 * In this path we respect the fillfactor setting, whereas insertions
+ 		 * after initial build do not.
+ 		 */
+ 		gistdoinsert(index, itup, buildstate->freespace,
+ 					 &buildstate->giststate);
+ 	}
+ 
+ 	/* Increase statistics of index tuples count and their total size. */
+ 	buildstate->indtuples += 1;
+ 	buildstate->indtuplesSize += IndexTupleSize(itup);
+ 
+ 	MemoryContextSwitchTo(oldCtx);
+ 	MemoryContextReset(buildstate->tmpCtx);
+ 
+ 	if (buildstate->bufferingMode == GIST_BUFFERING_ACTIVE &&
+ 		buildstate->indtuples % BUFFERING_MODE_TUPLE_SIZE_STATS_TARGET == 0)
+ 	{
+ 		/* Adjust the target buffer size now */
+ 		buildstate->giststate.gfbb->pagesPerBuffer =
+ 			calculatePagesPerBuffer(buildstate, index,
+ 									buildstate->giststate.gfbb->levelStep);
+ 	}
+ 
+ 	/*
+ 	 * In 'auto' mode, check if the index has grown too large to fit in
+ 	 * cache, and switch to buffering mode if it has.
+ 	 *
+ 	 * To avoid excessive calls to smgrnblocks(), only check this every
+ 	 * BUFFERING_MODE_SWITCH_CHECK_STEP index tuples
+ 	 */
+ 	if ((buildstate->bufferingMode == GIST_BUFFERING_AUTO &&
+ 		 buildstate->indtuples % BUFFERING_MODE_SWITCH_CHECK_STEP == 0 &&
+ 		 effective_cache_size < smgrnblocks(index->rd_smgr, MAIN_FORKNUM)) ||
+ 		(buildstate->bufferingMode == GIST_BUFFERING_STATS &&
+ 		 buildstate->indtuples >= BUFFERING_MODE_TUPLE_SIZE_STATS_TARGET))
+ 	{
+ 		/*
+ 		 * Index doesn't fit in effective cache anymore. Try to switch to
+ 		 * buffering build mode.
+ 		 */
+ 		if (gistInitBuffering(buildstate, index))
+ 		{
+ 			/*
+ 			 * Buffering build is successfully initialized. Now we can set
+ 			 * appropriate flag.
+ 			 */
+ 			buildstate->bufferingMode = GIST_BUFFERING_ACTIVE;
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Failed to switch to buffering build due to not enough memory
+ 			 * settings. Mark that we aren't going to switch anymore.
+ 			 */
+ 			buildstate->bufferingMode = GIST_BUFFERING_DISABLED;
+ 		}
+ 	}
+ }
+ 
+ /*
+  * Calculate pagesPerBuffer parameter for the buffering algorithm.
+  *
+  * Buffer size is chosen so that assuming that tuples are distributed
+  * randomly, emptying half a buffer fills on average one page in every buffer
+  * at the next lower level.
+  */
+ static int
+ calculatePagesPerBuffer(GISTBuildState *buildstate, Relation index,
+ 						int levelStep)
+ {
+ 	double		pagesPerBuffer;
+ 	double		avgIndexTuplesPerPage;
+ 	double		itupAvgSize;
+ 	Size		pageFreeSpace;
+ 
+ 	/* Calc space of index page which is available for index tuples */
+ 	pageFreeSpace = BLCKSZ - SizeOfPageHeaderData - sizeof(GISTPageOpaqueData)
+ 		- sizeof(ItemIdData)
+ 		- buildstate->freespace;
+ 
+ 	/*
+ 	 * Calculate average size of already inserted index tuples using
+ 	 * gathered statistics.
+ 	 */
+ 	itupAvgSize = (double) buildstate->indtuplesSize /
+ 				  (double) buildstate->indtuples;
+ 
+ 	avgIndexTuplesPerPage = pageFreeSpace / itupAvgSize;
+ 
+ 	/*
+ 	 * Recalculate required size of buffers.
+ 	 */
+ 	pagesPerBuffer = 2 * pow(avgIndexTuplesPerPage, levelStep);
+ 
+ 	return round(pagesPerBuffer);
+ }
+ 
+ 
+ /*
+  * Get the depth of the GiST index.
+  */
+ static int
+ gistGetMaxLevel(Relation index)
+ {
+ 	int			maxLevel;
+ 	BlockNumber blkno;
+ 
+ 	/*
+ 	 * Traverse down the tree, starting from the root, until we hit the
+ 	 * leaf level.
+ 	 */
+ 	maxLevel = 0;
+ 	blkno = GIST_ROOT_BLKNO;
+ 	while (true)
+ 	{
+ 		Buffer		buffer;
+ 		Page		page;
+ 		IndexTuple	itup;
+ 
+ 		buffer = ReadBuffer(index, blkno);
+ 		page = (Page) BufferGetPage(buffer);
+ 
+ 		if (GistPageIsLeaf(page))
+ 		{
+ 			/* We hit the bottom, so we're done. */
+ 			ReleaseBuffer(buffer);
+ 			break;
+ 		}
+ 
+ 		/*
+ 		 * Pick the first downlink on the page, and follow it. It doesn't
+ 		 * matter which downlink we choose, the tree has the same depth
+ 		 * everywhere, so we just pick the first one.
+ 		 */
+ 		itup = (IndexTuple) PageGetItem(page,
+ 									 PageGetItemId(page, FirstOffsetNumber));
+ 		blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+ 		ReleaseBuffer(buffer);
+ 
+ 		/*
+ 		 * We're going down on the tree. It means that there is yet one more
+ 		 * level is the tree.
+ 		 */
+ 		maxLevel++;
+ 	}
+ 	return maxLevel;
+ }
+ 
+ /*
+  * Initial calculations for GiST buffering build.
+  */
+ static bool
+ gistInitBuffering(GISTBuildState *buildstate, Relation index)
+ {
+ 	int			pagesPerBuffer;
+ 	Size		pageFreeSpace;
+ 	Size		itupAvgSize,
+ 				itupMinSize;
+ 	double		avgIndexTuplesPerPage,
+ 				maxIndexTuplesPerPage;
+ 	int			i;
+ 	int			levelStep;
+ 	GISTBuildBuffers *gfbb;
+ 
+ 	/* Calc space of index page which is available for index tuples */
+ 	pageFreeSpace = BLCKSZ - SizeOfPageHeaderData - sizeof(GISTPageOpaqueData)
+ 		- sizeof(ItemIdData)
+ 		- buildstate->freespace;
+ 
+ 	/*
+ 	 * Calculate average size of already inserted index tuples using gathered
+ 	 * statistics.
+ 	 */
+ 	itupAvgSize = (double) buildstate->indtuplesSize /
+ 				  (double) buildstate->indtuples;
+ 
+ 	/*
+ 	 * Calculate minimal possible size of index tuple by index metadata.
+ 	 * Minimal possible size of varlena is VARHDRSZ.
+ 	 *
+ 	 * XXX: that's not actually true, as a short varlen can be just 2 bytes.
+ 	 * And we should take padding into account here.
+ 	 */
+ 	itupMinSize = (Size) MAXALIGN(sizeof(IndexTupleData));
+ 	for (i = 0; i < index->rd_att->natts; i++)
+ 	{
+ 		if (index->rd_att->attrs[i]->attlen < 0)
+ 			itupMinSize += VARHDRSZ;
+ 		else
+ 			itupMinSize += index->rd_att->attrs[i]->attlen;
+ 	}
+ 
+ 	/* Calculate average and maximal number of index tuples which fit to page */
+ 	avgIndexTuplesPerPage = pageFreeSpace / itupAvgSize;
+ 	maxIndexTuplesPerPage = pageFreeSpace / itupMinSize;
+ 
+ 	/*
+ 	 * We need to calculate two parameters for the buffering algorithm:
+ 	 * levelStep and pagesPerBuffer.
+ 	 *
+ 	 * levelStep determines the size of subtree that we operate on, while
+ 	 * emptying a buffer. A higher value is better, as you need fewer buffer
+ 	 * emptying steps to perform the index build. However, if you set it too
+ 	 * high, the subtree doesn't fit in cache anymore, and you quickly lose
+ 	 * the benefit of the buffers.
+ 	 *
+ 	 * In Arge et al's paper, levelStep is chosen as logB(M/4B), where B is
+ 	 * the number of tuples on page (ie. fanout), and M is the amount of
+ 	 * internal memory available. Curiously, they doesn't explain *why* that
+ 	 * setting is optimal. We calculate it by taking the highest levelStep
+ 	 * so that a subtree still fits in cache. For a small B, our way of
+ 	 * calculating levelStep is very close to Arge et al's formula. For a
+ 	 * large B, our formula gives a value that is 2x higher.
+ 	 *
+ 	 * The average size of a subtree of depth n can be calculated as a
+ 	 * geometric series:
+ 	 *
+ 	 *		B^0 + B^1 + B^2 + ... + B^n = (1 - B^(n + 1)) / (1 - B)
+ 	 *
+ 	 * where B is the average number of index tuples on page. The subtree is
+ 	 * cached in the shared buffer cache and the OS cache, so we choose
+ 	 * levelStep so that the subtree size is comfortably smaller than
+ 	 * effective_cache_size, with a safety factor of 4.
+ 	 *
+ 	 * The estimate on the average number of index tuples on page is based on
+ 	 * average tuple sizes observed before switching to buffered build, so the
+ 	 * real subtree size can be somewhat larger. Also, it would selfish to
+ 	 * gobble the whole cache for our index build. The safety factor of 4
+ 	 * should account for those effects.
+ 	 *
+ 	 * The other limiting factor for setting levelStep is that while
+ 	 * processing a subtree, we need to hold one page for each buffer at the
+ 	 * next lower buffered level. The max. number of buffers needed for that
+ 	 * is maxIndexTuplesPerPage^levelStep. This is very conservative, but
+ 	 * hopefully maintenance_work_mem is set high enough that you're
+ 	 * constrained by effective_cache_size rather than maintenance_work_mem.
+ 	 *
+ 	 * XXX: the buffer hash table consumes a fair amount of memory too per
+ 	 * buffer, but that is not currently taken into account. That scales on
+ 	 * the total number of buffers used, ie. the index size and on levelStep.
+ 	 * Note that a higher levelStep *reduces* the amount of memory needed for
+ 	 * the hash table.
+ 	 */
+ 	levelStep = 1;
+ 	while (
+ 		/* subtree must fit in cache (with safety factor of 4) */
+ 		(1 - pow(avgIndexTuplesPerPage, (double) (levelStep + 1))) / (1 - avgIndexTuplesPerPage) < effective_cache_size / 4
+ 		&&
+ 		/* each node in the lowest level of a subtree has one page in memory */
+ 		(pow(maxIndexTuplesPerPage, (double) levelStep) < (maintenance_work_mem * 1024) / BLCKSZ)
+ 		)
+ 	{
+ 		levelStep++;
+ 	}
+ 
+ 	/*
+ 	 * We've just reached unacceptable value of levelStep in previous loop.
+ 	 * So, decrease levelStep to get last acceptable value.
+ 	 */
+ 	levelStep--;
+ 
+ 	/*
+ 	 * If there's not enough cache or maintenance_work_mem, fall back to plain
+ 	 * inserts.
+ 	 */
+ 	if (levelStep <= 0)
+ 	{
+ 		elog(DEBUG1, "failed to switch to buffered GiST build");
+ 		return false;
+ 	}
+ 
+ 	/*
+ 	 * The second parameter to set is pagesPerBuffer, which determines the
+ 	 * size of each buffer. We adjust pagesPerBuffer also during the build,
+ 	 * which is why this calculation is in a separate function.
+ 	 */
+ 	pagesPerBuffer = calculatePagesPerBuffer(buildstate, index, levelStep);
+ 
+ 	elog(DEBUG1, "switching to buffered GiST build; level step = %d, pagesPerBuffer = %d",
+ 		 levelStep, pagesPerBuffer);
+ 
+ 	/* Initialize GISTBuildBuffers with these parameters */
+ 	gfbb = palloc(sizeof(GISTBuildBuffers));
+ 	gfbb->pagesPerBuffer = pagesPerBuffer;
+ 	gfbb->levelStep = levelStep;
+ 	gistInitBuildBuffers(gfbb, gistGetMaxLevel(index));
+ 
+ 	buildstate->giststate.gfbb = gfbb;
+ 
+ 	return true;
+ }
*** /dev/null
--- b/src/backend/access/gist/gistbuildbuffers.c
***************
*** 0 ****
--- 1,795 ----
+ /*-------------------------------------------------------------------------
+  *
+  * gistbuildbuffers.c
+  *	  node buffer management functions for GiST buffering build algorithm.
+  *
+  *
+  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * IDENTIFICATION
+  *	  src/backend/access/gist/gistbuildbuffers.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+ 
+ #include "access/genam.h"
+ #include "access/gist_private.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_collation.h"
+ #include "miscadmin.h"
+ #include "storage/buffile.h"
+ #include "storage/bufmgr.h"
+ #include "storage/indexfsm.h"
+ #include "utils/memutils.h"
+ #include "utils/rel.h"
+ 
+ static GISTNodeBufferPage *gistAllocateNewPageBuffer(GISTBuildBuffers *gfbb);
+ static void gistAddLoadedBuffer(GISTBuildBuffers *gfbb, BlockNumber blocknum);
+ static void gistLoadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer);
+ static void gistUnloadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer);
+ static void gistPlaceItupToPage(GISTNodeBufferPage *pageBuffer, IndexTuple item);
+ static void gistGetItupFromPage(GISTNodeBufferPage *pageBuffer, IndexTuple *item);
+ static int	gistBuffersFreeBlocksCmp(const void *a, const void *b);
+ static long gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb);
+ static void gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, long blocknum);
+ 
+ /*
+  * Initialize GiST buffering build data structure.
+  */
+ void
+ gistInitBuildBuffers(GISTBuildBuffers *gfbb, int maxLevel)
+ {
+ 	HASHCTL		hashCtl;
+ 
+ 	/*
+ 	 * Create a temporary file to hold buffer pages that are swapped out
+ 	 * of memory. Initialize data structures for free pages management.
+ 	 */
+ 	gfbb->pfile = BufFileCreateTemp(true);
+ 	gfbb->nFileBlocks = 0;
+ 	gfbb->nFreeBlocks = 0;
+ 	gfbb->blocksSorted = false;
+ 	gfbb->freeBlocksLen = 32;
+ 	gfbb->freeBlocks = (long *) palloc(gfbb->freeBlocksLen * sizeof(long));
+ 
+ 	/*
+ 	 * Current memory context will be used for all in-memory data structures
+ 	 * of buffers which are persistent during buffering build.
+ 	 */
+ 	gfbb->context = CurrentMemoryContext;
+ 
+ 	/*
+ 	 * nodeBuffersTab hash is association between index blocks and it's
+ 	 * buffers.
+ 	 */
+ 	hashCtl.keysize = sizeof(BlockNumber);
+ 	hashCtl.entrysize = sizeof(GISTNodeBuffer);
+ 	hashCtl.hcxt = CurrentMemoryContext;
+ 	hashCtl.hash = tag_hash;
+ 	hashCtl.match = memcmp;
+ 	gfbb->nodeBuffersTab = hash_create("gistbuildbuffers",
+ 									   1024,
+ 									   &hashCtl,
+ 									   HASH_ELEM | HASH_CONTEXT
+ 									   | HASH_FUNCTION | HASH_COMPARE);
+ 
+ 	gfbb->bufferEmptyingQueue = NIL;
+ 
+ 	gfbb->currentEmptyingBufferBlockNumber = InvalidBlockNumber;
+ 	gfbb->currentEmptyingBufferSplit = false;
+ 
+ 	/*
+ 	 * Per-level node buffers lists for final buffers emptying process. Node
+ 	 * buffers are inserted here when they are created.
+ 	 */
+ 	gfbb->buffersOnLevelsLen = 1;
+ 	gfbb->buffersOnLevels = (List **) palloc(sizeof(List *) *
+ 											 gfbb->buffersOnLevelsLen);
+ 	gfbb->buffersOnLevels[0] = NIL;
+ 
+ 	/*
+ 	 * Block numbers of node buffers which last pages are currently loaded
+ 	 * into main memory.
+ 	 */
+ 	gfbb->loadedBuffersLen = 32;
+ 	gfbb->loadedBuffers = (BlockNumber *) palloc(gfbb->loadedBuffersLen *
+ 												 sizeof(BlockNumber));
+ 	gfbb->loadedBuffersCount = 0;
+ 
+ 	/*
+ 	 * Root path item of the tree. Updated on each root node split.
+ 	 */
+ 	gfbb->rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc(
+ 							gfbb->context, sizeof(GISTBufferingInsertStack));
+ 	gfbb->rootitem->parent = NULL;
+ 	gfbb->rootitem->blkno = GIST_ROOT_BLKNO;
+ 	gfbb->rootitem->downlinkoffnum = InvalidOffsetNumber;
+ 	gfbb->rootitem->level = maxLevel;
+ 	gfbb->rootitem->refCount = 1;
+ }
+ 
+ /*
+  * Returns a node buffer for given block. The buffer is created if it
+  * doesn't exist yet.
+  */
+ GISTNodeBuffer *
+ gistGetNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
+ 				  BlockNumber nodeBlocknum,
+ 				  OffsetNumber downlinkoffnum,
+ 				  GISTBufferingInsertStack *parent)
+ {
+ 	GISTNodeBuffer *nodeBuffer;
+ 	bool		found;
+ 
+ 	/* Find node buffer in hash table */
+ 	nodeBuffer = (GISTNodeBuffer *) hash_search(gfbb->nodeBuffersTab,
+ 												(const void *) &nodeBlocknum,
+ 												HASH_ENTER,
+ 												&found);
+ 	if (!found)
+ 	{
+ 		/*
+ 		 * Node buffer wasn't found. Initialize the new buffer as empty.
+ 		 */
+ 		GISTBufferingInsertStack *path;
+ 		int			level;
+ 		MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
+ 
+ 		nodeBuffer->pageBuffer = NULL;
+ 		nodeBuffer->blocksCount = 0;
+ 		nodeBuffer->queuedForEmptying = false;
+ 
+ 		/*
+ 		 * Create a path stack for the page.
+ 		 */
+ 		if (nodeBlocknum != GIST_ROOT_BLKNO)
+ 		{
+ 			path = (GISTBufferingInsertStack *) palloc(
+ 										   sizeof(GISTBufferingInsertStack));
+ 			path->parent = parent;
+ 			path->blkno = nodeBlocknum;
+ 			path->downlinkoffnum = downlinkoffnum;
+ 			path->level = parent->level - 1;
+ 			path->refCount = 0;		/* initially unreferenced */
+ 			parent->refCount++;		/* this path references its parent */
+ 			Assert(path->level > 0);
+ 		}
+ 		else
+ 			path = gfbb->rootitem;
+ 
+ 		nodeBuffer->path = path;
+ 		path->refCount++;
+ 
+ 		/*
+ 		 * Add this buffer to the list of buffers on this level. Enlarge
+ 		 * buffersOnLevels array if needed.
+ 		 */
+ 		level = path->level;
+ 		if (level >= gfbb->buffersOnLevelsLen)
+ 		{
+ 			int			i;
+ 
+ 			gfbb->buffersOnLevels =
+ 				(List **) repalloc(gfbb->buffersOnLevels,
+ 								   (level + 1) * sizeof(List *));
+ 
+ 			/* initialize the enlarged portion */
+ 			for (i = gfbb->buffersOnLevelsLen; i <= level; i++)
+ 				gfbb->buffersOnLevels[i] = NIL;
+ 			gfbb->buffersOnLevelsLen = level + 1;
+ 		}
+ 
+ 		gfbb->buffersOnLevels[level] = lcons(nodeBuffer,
+ 											 gfbb->buffersOnLevels[level]);
+ 
+ 		MemoryContextSwitchTo(oldcxt);
+ 	}
+ 	else
+ 	{
+ 		if (parent != nodeBuffer->path->parent)
+ 		{
+ 			/*
+ 			 * Other parent path item was provided than we've remembered. We
+ 			 * trust caller to provide more correct parent than we have.
+ 			 * Previous parent may be outdated by page split.
+ 			 */
+ 			gistDecreasePathRefcount(nodeBuffer->path->parent);
+ 			nodeBuffer->path->parent = parent;
+ 			parent->refCount++;
+ 		}
+ 	}
+ 
+ 	return nodeBuffer;
+ }
+ 
+ /*
+  * Allocate memory for a buffer page.
+  */
+ static GISTNodeBufferPage *
+ gistAllocateNewPageBuffer(GISTBuildBuffers *gfbb)
+ {
+ 	GISTNodeBufferPage *pageBuffer;
+ 
+ 	pageBuffer = (GISTNodeBufferPage *) MemoryContextAlloc(gfbb->context,
+ 														   BLCKSZ);
+ 	pageBuffer->prev = InvalidBlockNumber;
+ 
+ 	/* Set page free space */
+ 	PAGE_FREE_SPACE(pageBuffer) = BLCKSZ - BUFFER_PAGE_DATA_OFFSET;
+ 	return pageBuffer;
+ }
+ 
+ /*
+  * Add specified block number into loadedBuffers array.
+  */
+ static void
+ gistAddLoadedBuffer(GISTBuildBuffers *gfbb, BlockNumber blocknum)
+ {
+ 	/* Enlarge the array if needed */
+ 	if (gfbb->loadedBuffersCount >= gfbb->loadedBuffersLen)
+ 	{
+ 		gfbb->loadedBuffersLen *= 2;
+ 		gfbb->loadedBuffers = (BlockNumber *) repalloc(gfbb->loadedBuffers,
+ 													 gfbb->loadedBuffersLen *
+ 													   sizeof(BlockNumber));
+ 	}
+ 
+ 	gfbb->loadedBuffers[gfbb->loadedBuffersCount] = blocknum;
+ 	gfbb->loadedBuffersCount++;
+ }
+ 
+ 
+ /*
+  * Load last page of node buffer into main memory.
+  */
+ static void
+ gistLoadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer)
+ {
+ 	/* Check if we really should load something */
+ 	if (!nodeBuffer->pageBuffer && nodeBuffer->blocksCount > 0)
+ 	{
+ 		/* Allocate memory for page */
+ 		nodeBuffer->pageBuffer = gistAllocateNewPageBuffer(gfbb);
+ 
+ 		/* Read block from temporary file */
+ 		BufFileSeekBlock(gfbb->pfile, nodeBuffer->pageBlocknum);
+ 		BufFileRead(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+ 
+ 		/* Mark file block as free */
+ 		gistBuffersReleaseBlock(gfbb, nodeBuffer->pageBlocknum);
+ 
+ 		/* Mark node buffer as loaded */
+ 		gistAddLoadedBuffer(gfbb, nodeBuffer->nodeBlocknum);
+ 		nodeBuffer->pageBlocknum = InvalidBlockNumber;
+ 	}
+ }
+ 
+ /*
+  * Write last page of node buffer to the disk.
+  */
+ static void
+ gistUnloadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer)
+ {
+ 	/* Check if we have something to write */
+ 	if (nodeBuffer->pageBuffer)
+ 	{
+ 		BlockNumber blkno;
+ 
+ 		/* Get free file block */
+ 		blkno = gistBuffersGetFreeBlock(gfbb);
+ 
+ 		/* Write block to the temporary file */
+ 		BufFileSeekBlock(gfbb->pfile, blkno);
+ 		BufFileWrite(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+ 
+ 		/* Free memory of that page */
+ 		pfree(nodeBuffer->pageBuffer);
+ 		nodeBuffer->pageBuffer = NULL;
+ 
+ 		/* Save block number */
+ 		nodeBuffer->pageBlocknum = blkno;
+ 	}
+ }
+ 
+ /*
+  * Write last pages of all node buffers to the disk.
+  */
+ void
+ gistUnloadNodeBuffers(GISTBuildBuffers *gfbb)
+ {
+ 	int			i;
+ 
+ 	/* Iterate over node buffers which last page is loaded into main memory */
+ 	for (i = 0; i < gfbb->loadedBuffersCount; i++)
+ 	{
+ 		GISTNodeBuffer *nodeBuffer;
+ 		bool		found;
+ 
+ 		/* Find node buffer by its block number */
+ 		nodeBuffer = hash_search(gfbb->nodeBuffersTab, &gfbb->loadedBuffers[i],
+ 								 HASH_FIND, &found);
+ 
+ 		/*
+ 		 * Node buffer can be not found. It can disappear during page split.
+ 		 * So, it's ok, just skip it.
+ 		 */
+ 		if (!found)
+ 			continue;
+ 
+ 		/* Unload last page to the disk */
+ 		gistUnloadNodeBuffer(gfbb, nodeBuffer);
+ 	}
+ 	/* Now there are no node buffers with loaded last page */
+ 	gfbb->loadedBuffersCount = 0;
+ }
+ 
+ /*
+  * Add index tuple to buffer page.
+  */
+ static void
+ gistPlaceItupToPage(GISTNodeBufferPage *pageBuffer, IndexTuple itup)
+ {
+ 	/*
+ 	 * Get pointer to the start of free space on the page
+ 	 */
+ 	char	   *ptr = (char *) pageBuffer + BUFFER_PAGE_DATA_OFFSET
+ 	+ PAGE_FREE_SPACE(pageBuffer) - MAXALIGN(IndexTupleSize(itup));
+ 
+ 	/*
+ 	 * There should be enough of space
+ 	 */
+ 	Assert(PAGE_FREE_SPACE(pageBuffer) >= MAXALIGN(IndexTupleSize(itup)));
+ 
+ 	/*
+ 	 * Reduce free space value of page
+ 	 */
+ 	PAGE_FREE_SPACE(pageBuffer) -= MAXALIGN(IndexTupleSize(itup));
+ 
+ 	/*
+ 	 * Copy index tuple to free space
+ 	 */
+ 	memcpy(ptr, itup, IndexTupleSize(itup));
+ }
+ 
+ /*
+  * Get last item from buffer page and remove it from page.
+  */
+ static void
+ gistGetItupFromPage(GISTNodeBufferPage *pageBuffer, IndexTuple *itup)
+ {
+ 	/*
+ 	 * Get pointer to last index tuple
+ 	 */
+ 	IndexTuple	ptr = (IndexTuple) ((char *) pageBuffer
+ 									+ BUFFER_PAGE_DATA_OFFSET
+ 									+ PAGE_FREE_SPACE(pageBuffer));
+ 
+ 	/*
+ 	 * Page shouldn't be empty
+ 	 */
+ 	Assert(!PAGE_IS_EMPTY(pageBuffer));
+ 
+ 	/*
+ 	 * Allocate memory for returned index tuple copy
+ 	 */
+ 	*itup = (IndexTuple) palloc(IndexTupleSize(ptr));
+ 
+ 	/*
+ 	 * Copy data
+ 	 */
+ 	memcpy(*itup, ptr, IndexTupleSize(ptr));
+ 
+ 	/*
+ 	 * Increase free space value of page
+ 	 */
+ 	PAGE_FREE_SPACE(pageBuffer) += MAXALIGN(IndexTupleSize(*itup));
+ }
+ 
+ /*
+  * Push an index tuple to node buffer.
+  */
+ void
+ gistPushItupToNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer,
+ 						 IndexTuple itup)
+ {
+ 	/*
+ 	 * Most part of memory operations will be in buffering build persistent
+ 	 * context. So, let's switch to it.
+ 	 */
+ 	MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
+ 
+ 	/* Is the buffer currently empty? */
+ 	if (nodeBuffer->blocksCount == 0)
+ 	{
+ 		/* It's empty, let's create the first page */
+ 		nodeBuffer->pageBuffer = gistAllocateNewPageBuffer(gfbb);
+ 		nodeBuffer->blocksCount = 1;
+ 		gistAddLoadedBuffer(gfbb, nodeBuffer->nodeBlocknum);
+ 	}
+ 
+ 	/* Load last page of node buffer if it wasn't already */
+ 	if (!nodeBuffer->pageBuffer)
+ 		gistLoadNodeBuffer(gfbb, nodeBuffer);
+ 
+ 	/*
+ 	 * Check if there is enough space on the last page for the tuple
+ 	 */
+ 	if (PAGE_NO_SPACE(nodeBuffer->pageBuffer, itup))
+ 	{
+ 		/*
+ 		 * Nope. Swap previous block to disk and allocate a new one.
+ 		 */
+ 		BlockNumber blkno;
+ 
+ 		/* Write filled page to the disk */
+ 		blkno = gistBuffersGetFreeBlock(gfbb);
+ 		BufFileSeekBlock(gfbb->pfile, blkno);
+ 		BufFileWrite(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+ 
+ 		/* Mark space of in-memory page as empty */
+ 		PAGE_FREE_SPACE(nodeBuffer->pageBuffer) =
+ 			BLCKSZ - MAXALIGN(offsetof(GISTNodeBufferPage, tupledata));
+ 
+ 		/* Save block number of the previous page */
+ 		nodeBuffer->pageBuffer->prev = blkno;
+ 
+ 		/* We've just added one more page */
+ 		nodeBuffer->blocksCount++;
+ 	}
+ 
+ 	gistPlaceItupToPage(nodeBuffer->pageBuffer, itup);
+ 
+ 	/*
+ 	 * If the buffer just overflowed, add it to the emptying queue.
+ 	 */
+ 	if (BUFFER_HALF_FILLED(nodeBuffer, gfbb) && !nodeBuffer->queuedForEmptying)
+ 	{
+ 		MemoryContextSwitchTo(gfbb->context);
+ 		gfbb->bufferEmptyingQueue =	lcons(nodeBuffer, gfbb->bufferEmptyingQueue);
+ 		nodeBuffer->queuedForEmptying = true;
+ 	}
+ 
+ 	/* Restore memory context */
+ 	MemoryContextSwitchTo(oldcxt);
+ }
+ 
+ /*
+  * Removes one index tuple from node buffer. Returns true if success and false
+  * if node buffer is empty.
+  */
+ bool
+ gistPopItupFromNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer,
+ 						  IndexTuple *itup)
+ {
+ 	/*
+ 	 * If node buffer is empty then return false.
+ 	 */
+ 	if (nodeBuffer->blocksCount <= 0)
+ 		return false;
+ 
+ 	/* Load last page of node buffer if needed */
+ 	if (!nodeBuffer->pageBuffer)
+ 		gistLoadNodeBuffer(gfbb, nodeBuffer);
+ 
+ 	/*
+ 	 * Get index tuple from last non-empty page.
+ 	 */
+ 	gistGetItupFromPage(nodeBuffer->pageBuffer, itup);
+ 
+ 	/*
+ 	 * Check if the page which the index tuple was got from is now empty
+ 	 */
+ 	if (PAGE_IS_EMPTY(nodeBuffer->pageBuffer))
+ 	{
+ 		BlockNumber prevblkno;
+ 
+ 		/*
+ 		 * If it's empty then we need to release buffer file block and free
+ 		 * page buffer.
+ 		 */
+ 		nodeBuffer->blocksCount--;
+ 
+ 		/*
+ 		 * If there's more pages, fetch previous one
+ 		 */
+ 		prevblkno = nodeBuffer->pageBuffer->prev;
+ 		if (prevblkno != InvalidBlockNumber)
+ 		{
+ 			/* There actually is previous page, so read it. */
+ 			Assert(nodeBuffer->blocksCount > 0);
+ 			BufFileSeekBlock(gfbb->pfile, prevblkno);
+ 			BufFileRead(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+ 
+ 			/* Mark block as free */
+ 			gistBuffersReleaseBlock(gfbb, prevblkno);
+ 		}
+ 		else
+ 		{
+ 			/* Actually there are no more pages. Free memory. */
+ 			Assert(nodeBuffer->blocksCount == 0);
+ 			pfree(nodeBuffer->pageBuffer);
+ 			nodeBuffer->pageBuffer = NULL;
+ 		}
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * qsort comparator for sorting freeBlocks[] into decreasing order.
+  */
+ static int
+ gistBuffersFreeBlocksCmp(const void *a, const void *b)
+ {
+ 	long		ablk = *((const long *) a);
+ 	long		bblk = *((const long *) b);
+ 
+ 	/*
+ 	 * can't just subtract because long might be wider than int
+ 	 */
+ 	if (ablk < bblk)
+ 		return 1;
+ 	if (ablk > bblk)
+ 		return -1;
+ 	return 0;
+ }
+ 
+ /*
+  * Select a currently unused block for writing to.
+  *
+  * NB: should only be called when writer is ready to write immediately,
+  * to ensure that first write pass is sequential.
+  */
+ static long
+ gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb)
+ {
+ 	/*
+ 	 * If there are multiple free blocks, we select the one appearing last in
+ 	 * freeBlocks[] (after sorting the array if needed).  If there are none,
+ 	 * assign the next block at the end of the file.
+ 	 */
+ 	if (gfbb->nFreeBlocks > 0)
+ 	{
+ 		if (!gfbb->blocksSorted)
+ 		{
+ 			qsort((void *) gfbb->freeBlocks, gfbb->nFreeBlocks,
+ 				  sizeof(long), gistBuffersFreeBlocksCmp);
+ 			gfbb->blocksSorted = true;
+ 		}
+ 		return gfbb->freeBlocks[--gfbb->nFreeBlocks];
+ 	}
+ 	else
+ 		return gfbb->nFileBlocks++;
+ }
+ 
+ /*
+  * Return a block# to the freelist.
+  */
+ static void
+ gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, long blocknum)
+ {
+ 	int			ndx;
+ 
+ 	/*
+ 	 * Enlarge freeBlocks array if full.
+ 	 */
+ 	if (gfbb->nFreeBlocks >= gfbb->freeBlocksLen)
+ 	{
+ 		gfbb->freeBlocksLen *= 2;
+ 		gfbb->freeBlocks = (long *) repalloc(gfbb->freeBlocks,
+ 											 gfbb->freeBlocksLen *
+ 											 sizeof(long));
+ 	}
+ 
+ 	/*
+ 	 * Add blocknum to array, and mark the array unsorted if it's no longer in
+ 	 * decreasing order.
+ 	 */
+ 	ndx = gfbb->nFreeBlocks++;
+ 	gfbb->freeBlocks[ndx] = blocknum;
+ 	if (ndx > 0 && gfbb->freeBlocks[ndx - 1] < blocknum)
+ 		gfbb->blocksSorted = false;
+ }
+ 
+ /*
+  * Free buffering build data structure.
+  */
+ void
+ gistFreeBuildBuffers(GISTBuildBuffers *gfbb)
+ {
+ 	/* Close buffers file. */
+ 	BufFileClose(gfbb->pfile);
+ 
+ 	/* All other things will be freed on memory context release */
+ }
+ 
+ /*
+  * Data structure representing information about node buffer for index tuples
+  * relocation from splitted node buffer.
+  */
+ typedef struct
+ {
+ 	GISTENTRY	entry[INDEX_MAX_KEYS];
+ 	bool		isnull[INDEX_MAX_KEYS];
+ 	GISTPageSplitInfo *splitinfo;
+ 	GISTNodeBuffer *nodeBuffer;
+ } RelocationBufferInfo;
+ 
+ /*
+  * Maintain data structures on page split.
+  */
+ void
+ gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
+ 								Relation r, GISTBufferingInsertStack *path,
+ 								Buffer buffer, List *splitinfo)
+ {
+ 	RelocationBufferInfo *relocationBuffersInfos;
+ 	bool		found;
+ 	GISTNodeBuffer *nodeBuffer;
+ 	BlockNumber blocknum;
+ 	IndexTuple	itup;
+ 	int			splitPagesCount = 0,
+ 				i;
+ 	GISTENTRY	entry[INDEX_MAX_KEYS];
+ 	bool		isnull[INDEX_MAX_KEYS];
+ 	GISTNodeBuffer nodebuf;
+ 	ListCell   *lc;
+ 
+ 	/*
+ 	 * If the splitted page level doesn't have buffers, we have nothing to do.
+ 	 */
+ 	if (!LEVEL_HAS_BUFFERS(path->level, gfbb))
+ 		return;
+ 
+ 	/*
+ 	 * Get pointer to node buffer of splitted page.
+ 	 */
+ 	blocknum = BufferGetBlockNumber(buffer);
+ 	nodeBuffer = hash_search(gfbb->nodeBuffersTab, &blocknum,
+ 							 HASH_FIND, &found);
+ 	if (!found)
+ 	{
+ 		/*
+ 		 * Node buffer should exist at this point. If it didn't exist before,
+ 		 * the insertion that caused the page to split should've created it.
+ 		 */
+ 		elog(ERROR, "node buffer of page being split (%u) does not exist",
+ 			 blocknum);
+ 	}
+ 
+ 	/*
+ 	 * Make a copy of the old buffer, as we're going reuse the old one as
+ 	 * the buffer for the new left page, which is on the same block as the
+ 	 * old page. That's not true for the root page, but that's fine because
+ 	 * we never have a buffer on the root page anyway. The original algorithm
+ 	 * as described by Arge et al did, but it's of no use, as you might as
+ 	 * well read the tuples straight from the heap instead of the root buffer.
+ 	 */
+ 	Assert(blocknum != GIST_ROOT_BLKNO);
+ 	memcpy(&nodebuf, nodeBuffer, sizeof(GISTNodeBuffer));
+ 
+ 	/* Reset the old buffer, used for the new left page from now on */
+ 	nodeBuffer->blocksCount = 0;
+ 	nodeBuffer->pageBuffer = NULL;
+ 	nodeBuffer->pageBlocknum = InvalidBlockNumber;
+ 
+ 	/* Reassign pointer to the saved copy. */
+ 	nodeBuffer = &nodebuf;
+ 
+ 	/*
+ 	 * Allocate memory for information about relocation buffers.
+ 	 */
+ 	splitPagesCount = list_length(splitinfo);
+ 	relocationBuffersInfos =
+ 		(RelocationBufferInfo *) palloc(sizeof(RelocationBufferInfo) *
+ 										splitPagesCount);
+ 
+ 	/*
+ 	 * Fill relocation buffers information for node buffers of pages produced
+ 	 * by split.
+ 	 */
+ 	i = 0;
+ 	foreach(lc, splitinfo)
+ 	{
+ 		GISTPageSplitInfo *si = (GISTPageSplitInfo *) lfirst(lc);
+ 		GISTNodeBuffer *newNodeBuffer;
+ 
+ 		/* Decompress parent index tuple of node buffer page. */
+ 		gistDeCompressAtt(giststate, r,
+ 						  si->downlink, NULL, (OffsetNumber) 0,
+ 						  relocationBuffersInfos[i].entry,
+ 						  relocationBuffersInfos[i].isnull);
+ 
+ 		newNodeBuffer = gistGetNodeBuffer(gfbb, giststate, BufferGetBlockNumber(si->buf),
+ 								   path->downlinkoffnum, path->parent);
+ 
+ 		relocationBuffersInfos[i].nodeBuffer = newNodeBuffer;
+ 		relocationBuffersInfos[i].splitinfo = si;
+ 
+ 		i++;
+ 	}
+ 
+ 	/*
+ 	 * Loop through all index tuples on the buffer on the splitted page,
+ 	 * moving all the tuples to the buffers on the new pages.
+ 	 */
+ 	while (gistPopItupFromNodeBuffer(gfbb, nodeBuffer, &itup))
+ 	{
+ 		float		sum_grow,
+ 					which_grow[INDEX_MAX_KEYS];
+ 		int			i,
+ 					which;
+ 		IndexTuple	newtup;
+ 
+ 		/*
+ 		 * Choose which page this tuple should go to.
+ 		 */
+ 		gistDeCompressAtt(giststate, r,
+ 						  itup, NULL, (OffsetNumber) 0, entry, isnull);
+ 
+ 		which = -1;
+ 		*which_grow = -1.0f;
+ 		sum_grow = 1.0f;
+ 
+ 		for (i = 0; i < splitPagesCount && sum_grow; i++)
+ 		{
+ 			int			j;
+ 			RelocationBufferInfo *splitPageInfo = &relocationBuffersInfos[i];
+ 
+ 			sum_grow = 0.0f;
+ 			for (j = 0; j < r->rd_att->natts; j++)
+ 			{
+ 				float		usize;
+ 
+ 				usize = gistpenalty(giststate, j,
+ 									&splitPageInfo->entry[j],
+ 									splitPageInfo->isnull[j],
+ 									&entry[j], isnull[j]);
+ 
+ 				if (which_grow[j] < 0 || usize < which_grow[j])
+ 				{
+ 					which = i;
+ 					which_grow[j] = usize;
+ 					if (j < r->rd_att->natts - 1 && i == 0)
+ 						which_grow[j + 1] = -1;
+ 					sum_grow += which_grow[j];
+ 				}
+ 				else if (which_grow[j] == usize)
+ 					sum_grow += usize;
+ 				else
+ 				{
+ 					sum_grow = 1;
+ 					break;
+ 				}
+ 			}
+ 		}
+ 
+ 		/*
+ 		 * push item to selected node buffer
+ 		 */
+ 		gistPushItupToNodeBuffer(gfbb, relocationBuffersInfos[which].nodeBuffer,
+ 								 itup);
+ 
+ 		/*
+ 		 * Adjust the downlink for this page, if needed.
+ 		 */
+ 		newtup = gistgetadjusted(r, relocationBuffersInfos[which].splitinfo->downlink,
+ 								 itup, giststate);
+ 		if (newtup)
+ 		{
+ 			gistDeCompressAtt(giststate, r,
+ 							  newtup, NULL, (OffsetNumber) 0,
+ 							  relocationBuffersInfos[which].entry,
+ 							  relocationBuffersInfos[which].isnull);
+ 
+ 			relocationBuffersInfos[which].splitinfo->downlink = newtup;
+ 		}
+ 	}
+ 
+ 	/* Report about splitting for current emptying buffer */
+ 	if (blocknum == gfbb->currentEmptyingBufferBlockNumber)
+ 		gfbb->currentEmptyingBufferSplit = true;
+ 
+ 	pfree(relocationBuffersInfos);
+ }
*** a/src/backend/access/gist/gistutil.c
--- b/src/backend/access/gist/gistutil.c
***************
*** 670,682 **** gistoptions(PG_FUNCTION_ARGS)
  {
  	Datum		reloptions = PG_GETARG_DATUM(0);
  	bool		validate = PG_GETARG_BOOL(1);
! 	bytea	   *result;
  
! 	result = default_reloptions(reloptions, validate, RELOPT_KIND_GIST);
  
- 	if (result)
- 		PG_RETURN_BYTEA_P(result);
- 	PG_RETURN_NULL();
  }
  
  /*
--- 670,699 ----
  {
  	Datum		reloptions = PG_GETARG_DATUM(0);
  	bool		validate = PG_GETARG_BOOL(1);
! 	relopt_value *options;
! 	GiSTOptions *rdopts;
! 	int			numoptions;
! 	static const relopt_parse_elt tab[] = {
! 		{"fillfactor", RELOPT_TYPE_INT, offsetof(GiSTOptions, fillfactor)},
! 		{"buffering", RELOPT_TYPE_STRING, offsetof(GiSTOptions, bufferingModeOffset)}
! 	};
  
! 	options = parseRelOptions(reloptions, validate, RELOPT_KIND_GIST,
! 							  &numoptions);
! 
! 	/* if none set, we're done */
! 	if (numoptions == 0)
! 		PG_RETURN_NULL();
! 
! 	rdopts = allocateReloptStruct(sizeof(GiSTOptions), options, numoptions);
! 
! 	fillRelOptions((void *) rdopts, sizeof(GiSTOptions), options, numoptions,
! 				   validate, tab, lengthof(tab));
! 
! 	pfree(options);
! 
! 	PG_RETURN_BYTEA_P(rdopts);
  
  }
  
  /*
*** a/src/backend/access/gist/gistxlog.c
--- b/src/backend/access/gist/gistxlog.c
***************
*** 266,272 **** gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
  			else
  				GistPageGetOpaque(page)->rightlink = xldata->origrlink;
  			GistPageGetOpaque(page)->nsn = xldata->orignsn;
! 			if (i < xlrec.data->npage - 1 && !isrootsplit)
  				GistMarkFollowRight(page);
  			else
  				GistClearFollowRight(page);
--- 266,273 ----
  			else
  				GistPageGetOpaque(page)->rightlink = xldata->origrlink;
  			GistPageGetOpaque(page)->nsn = xldata->orignsn;
! 			if (i < xlrec.data->npage - 1 && !isrootsplit &&
! 				!xldata->noFollowRight)
  				GistMarkFollowRight(page);
  			else
  				GistClearFollowRight(page);
***************
*** 414,420 **** XLogRecPtr
  gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
  			  SplitedPageLayout *dist,
  			  BlockNumber origrlink, GistNSN orignsn,
! 			  Buffer leftchildbuf)
  {
  	XLogRecData *rdata;
  	gistxlogPageSplit xlrec;
--- 415,421 ----
  gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
  			  SplitedPageLayout *dist,
  			  BlockNumber origrlink, GistNSN orignsn,
! 			  Buffer leftchildbuf, bool noFollowFight)
  {
  	XLogRecData *rdata;
  	gistxlogPageSplit xlrec;
***************
*** 436,441 **** gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
--- 437,443 ----
  	xlrec.npage = (uint16) npage;
  	xlrec.leftchild =
  		BufferIsValid(leftchildbuf) ? BufferGetBlockNumber(leftchildbuf) : InvalidBlockNumber;
+ 	xlrec.noFollowRight = noFollowFight;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = sizeof(gistxlogPageSplit);
*** a/src/include/access/gist_private.h
--- b/src/include/access/gist_private.h
***************
*** 17,29 ****
--- 17,72 ----
  #include "access/gist.h"
  #include "access/itup.h"
  #include "storage/bufmgr.h"
+ #include "storage/buffile.h"
  #include "utils/rbtree.h"
+ #include "utils/hsearch.h"
+ 
+ /* Has specified level buffers? */
+ #define LEVEL_HAS_BUFFERS(nlevel, gfbb) ((nlevel) != 0 && (nlevel) % (gfbb)->levelStep == 0 && nlevel != (gfbb)->rootitem->level)
+ /* Is specified buffer at least half-filled (should be planned for emptying)?*/
+ #define BUFFER_HALF_FILLED(nodeBuffer, gfbb) ((nodeBuffer)->blocksCount > (gfbb)->pagesPerBuffer / 2)
+ /* Is specified buffer overflowed (can't take index tuples anymore)?*/
+ #define BUFFER_OVERFLOWED(nodeBuffer, gfbb) ((nodeBuffer)->blocksCount > (gfbb)->pagesPerBuffer)
  
  /* Buffer lock modes */
  #define GIST_SHARE	BUFFER_LOCK_SHARE
  #define GIST_EXCLUSIVE	BUFFER_LOCK_EXCLUSIVE
  #define GIST_UNLOCK BUFFER_LOCK_UNLOCK
  
+ typedef struct
+ {
+ 	BlockNumber prev;
+ 	uint32		freespace;
+ 	char		tupledata[1];
+ } GISTNodeBufferPage;
+ 
+ #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
+ /* Returns free space in node buffer page */
+ #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
+ /* Checks if node buffer page is empty */
+ #define PAGE_IS_EMPTY(nbp) (nbp->freespace == BLCKSZ - BUFFER_PAGE_DATA_OFFSET)
+ /* Checks if node buffers page don't contain sufficient space for index tuple */
+ #define PAGE_NO_SPACE(nbp, itup) (PAGE_FREE_SPACE(nbp) < \
+ 										MAXALIGN(IndexTupleSize(itup)))
+ 
+ /* Buffer of tree node data structure */
+ typedef struct
+ {
+ 	/* number of page containing node */
+ 	BlockNumber nodeBlocknum;
+ 
+ 	/* count of blocks occupied by buffer */
+ 	int32		blocksCount;
+ 
+ 	BlockNumber pageBlocknum;
+ 	GISTNodeBufferPage *pageBuffer;
+ 
+ 	/* is this buffer queued for emptying? */
+ 	bool		queuedForEmptying;
+ 
+ 	struct GISTBufferingInsertStack *path;
+ } GISTNodeBuffer;
+ 
  /*
   * GISTSTATE: information needed for any GiST index operation
   *
***************
*** 44,49 **** typedef struct GISTSTATE
--- 87,94 ----
  	/* Collations to pass to the support functions */
  	Oid			supportCollation[INDEX_MAX_KEYS];
  
+ 	struct GISTBuildBuffers *gfbb;
+ 
  	TupleDesc	tupdesc;
  } GISTSTATE;
  
***************
*** 170,175 **** typedef struct gistxlogPageSplit
--- 215,221 ----
  
  	BlockNumber leftchild;		/* like in gistxlogPageUpdate */
  	uint16		npage;			/* # of pages in the split */
+ 	bool		noFollowRight;	/* skip followRight flag setting */
  
  	/*
  	 * follow: 1. gistxlogPage and array of IndexTupleData per page
***************
*** 225,230 **** typedef struct GISTInsertStack
--- 271,346 ----
  	struct GISTInsertStack *parent;
  } GISTInsertStack;
  
+ /*
+  * Extended GISTInsertStack for buffering GiST index build. It additionally hold
+  * level number of page.
+  */
+ typedef struct GISTBufferingInsertStack
+ {
+ 	/* current page */
+ 	BlockNumber blkno;
+ 
+ 	/* offset of the downlink in the parent page, that points to this page */
+ 	OffsetNumber downlinkoffnum;
+ 
+ 	/* pointer to parent */
+ 	struct GISTBufferingInsertStack *parent;
+ 
+ 	int			refCount;
+ 
+ 	/* level number */
+ 	int			level;
+ }	GISTBufferingInsertStack;
+ 
+ /*
+  * Data structure with general information about build buffers.
+  */
+ typedef struct GISTBuildBuffers
+ {
+ 	/* memory context which is persistent during buffering build */
+ 	MemoryContext context;
+ 	/* underlying files */
+ 	BufFile    *pfile;
+ 	/* # of blocks used in underlying files */
+ 	long		nFileBlocks;
+ 	/* is freeBlocks[] currently in order? */
+ 	bool		blocksSorted;
+ 	/* resizable array of free blocks */
+ 	long	   *freeBlocks;
+ 	/* # of currently free blocks */
+ 	int			nFreeBlocks;
+ 	/* current allocated length of freeBlocks[] */
+ 	int			freeBlocksLen;
+ 
+ 	/* hash for buffers by block number */
+ 	HTAB	   *nodeBuffersTab;
+ 
+ 	/* stack of buffers for emptying */
+ 	List	   *bufferEmptyingQueue;
+ 	/* number of currently emptying buffer */
+ 	BlockNumber currentEmptyingBufferBlockNumber;
+ 	/* whether currently emptying buffer was split - a signal to stop emptying */
+ 	bool		currentEmptyingBufferSplit;
+ 
+ 	/* step of levels for buffers location */
+ 	int			levelStep;
+ 	/* maximal number of pages occupied by buffer */
+ 	int			pagesPerBuffer;
+ 
+ 	/* array of lists of non-empty buffers on levels for final emptying */
+ 	List	  **buffersOnLevels;
+ 	int			buffersOnLevelsLen;
+ 
+ 	/*
+ 	 * Dynamically-sized array of block numbers of buffers loaded into main
+ 	 * memory.
+ 	 */
+ 	BlockNumber *loadedBuffers;
+ 	int			loadedBuffersCount;		/* entries currently in loadedBuffers */
+ 	int			loadedBuffersLen;		/* allocated size of loadedBuffers */
+ 	GISTBufferingInsertStack *rootitem;
+ }	GISTBuildBuffers;
+ 
  typedef struct GistSplitVector
  {
  	GIST_SPLITVEC splitVector;	/* to/from PickSplit method */
***************
*** 286,291 **** extern Datum gistinsert(PG_FUNCTION_ARGS);
--- 402,424 ----
  extern MemoryContext createTempGistContext(void);
  extern void initGISTstate(GISTSTATE *giststate, Relation index);
  extern void freeGISTstate(GISTSTATE *giststate);
+ extern void gistdoinsert(Relation r,
+ 			 IndexTuple itup,
+ 			 Size freespace,
+ 			 GISTSTATE *GISTstate);
+ 
+ /* A List of these is returned from gistplacetopage() in *splitinfo */
+ typedef struct
+ {
+ 	Buffer		buf;			/* the split page "half" */
+ 	IndexTuple	downlink;		/* downlink for this half. */
+ } GISTPageSplitInfo;
+ 
+ extern bool gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
+ 				Buffer buffer,
+ 				IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
+ 				Buffer leftchildbuf,
+ 				List **splitinfo);
  
  extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
  		  int len, GISTSTATE *giststate);
***************
*** 305,311 **** extern XLogRecPtr gistXLogSplit(RelFileNode node,
  			  BlockNumber blkno, bool page_is_leaf,
  			  SplitedPageLayout *dist,
  			  BlockNumber origrlink, GistNSN oldnsn,
! 			  Buffer leftchild);
  
  /* gistget.c */
  extern Datum gistgettuple(PG_FUNCTION_ARGS);
--- 438,444 ----
  			  BlockNumber blkno, bool page_is_leaf,
  			  SplitedPageLayout *dist,
  			  BlockNumber origrlink, GistNSN oldnsn,
! 			  Buffer leftchild, bool noFollowFight);
  
  /* gistget.c */
  extern Datum gistgettuple(PG_FUNCTION_ARGS);
***************
*** 313,318 **** extern Datum gistgetbitmap(PG_FUNCTION_ARGS);
--- 446,461 ----
  
  /* gistutil.c */
  
+ /*
+  * Storage type for GiST's reloptions
+  */
+ typedef struct GiSTOptions
+ {
+ 	int32		vl_len_;		/* varlena header (do not touch directly!) */
+ 	int			fillfactor;		/* page fill factor in percent (0..100) */
+ 	int			bufferingModeOffset;	/* use buffering build? */
+ }	GiSTOptions;
+ 
  #define GiSTPageSize   \
  	( BLCKSZ - SizeOfPageHeaderData - MAXALIGN(sizeof(GISTPageOpaqueData)) )
  
***************
*** 380,383 **** extern void gistSplitByKey(Relation r, Page page, IndexTuple *itup,
--- 523,546 ----
  			   GistSplitVector *v, GistEntryVector *entryvec,
  			   int attno);
  
+ /* gistbuild.c */
+ extern void gistDecreasePathRefcount(GISTBufferingInsertStack *path);
+ extern void gistValidateBufferingOption(char *value);
+ 
+ /* gistbuildbuffers.c */
+ extern void gistInitBuildBuffers(GISTBuildBuffers *gfbb, int maxLevel);
+ GISTNodeBuffer *gistGetNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
+ 				  BlockNumber blkno, OffsetNumber downlinkoffnu,
+ 				  GISTBufferingInsertStack *parent);
+ extern void gistPushItupToNodeBuffer(GISTBuildBuffers *gfbb,
+ 						 GISTNodeBuffer *nodeBuffer, IndexTuple item);
+ extern bool gistPopItupFromNodeBuffer(GISTBuildBuffers *gfbb,
+ 						  GISTNodeBuffer *nodeBuffer, IndexTuple *item);
+ extern void gistFreeBuildBuffers(GISTBuildBuffers *gfbb);
+ extern void gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb,
+ 								GISTSTATE *giststate, Relation r,
+ 							  GISTBufferingInsertStack *path, Buffer buffer,
+ 								List *splitinfo);
+ extern void gistUnloadNodeBuffers(GISTBuildBuffers *gfbb);
+ 
  #endif   /* GIST_PRIVATE_H */
#116Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#115)
4 attachment(s)
Re: WIP: Fast GiST index build

On Thu, Aug 25, 2011 at 11:08 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Could you share the test scripts, patches and data sets etc. needed to
reproduce the tests you've been running? I'd like to try them out on a test
server.

1) I've updated links to the datasets on the wiki page.
2) Script for index quality testing fastbuild_test.php is in the attachment.
In order to run it you need PHP with pdo and pdo_pgsql modules. Also
plantuner moduler is required (it is used to force planer to use specific
index). After running that script following query returns relative score of
index quality:

select indexname, avg(count::real/(select count from test_result a2 where
a2.indexname = 'usnoa2_idx3' and a2.predicate = a1.predicate and
a2.tablename = a1.tablename)::real) from test_result a1 where a1.tablename =
'usnoa2' group by indexname;

where 'usnoa2' - table name, 'usnoa2_idx3' - name of index which quality was
assumed to be 1.
3) Patch which makes plantuner work with HEAD is also in attachment.
4) Patch with my split algorithm implementation is attached. Now it's form
is appropriate only for testing purposes.
5) For indexes creation I use simple script which is attached as
'indexes.sql'. Also, similar script with different index names I'm running
with my split patch.

Feel free to ask questions about all this stuff.

------
With best regards,
Alexander Korotkov.

Attachments:

fastbuild_test.php.gzapplication/x-gzip; name=fastbuild_test.php.gzDownload
plantuner.patch.gzapplication/x-gzip; name=plantuner.patch.gzDownload
my_split.patch.gzapplication/x-gzip; name=my_split.patch.gzDownload
indexes.sqltext/x-sql; charset=US-ASCII; name=indexes.sqlDownload
#117Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#114)
Re: WIP: Fast GiST index build

On Thu, Aug 25, 2011 at 10:53 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

In the tests on the first version of patch I found index quality of
regular
build much better than it of buffering build (without neighborrelocation).
Now it's similar, though it's because index quality of regular index build
become worse. There by in current tests regular index build is faster than
in previous. I see following possible causes of it:
1) I didn't save source random data. So, now it's a new random data.
2) Some environment parameters of my test setup may alters, though I
doubt.
Despite these possible explanation it seems quite strange for me.

That's pretty surprising. Assuming the data is truly random, I wouldn't
expect a big difference in the index quality of one random data set over
another. If the index quality depends so much on, say, the distribution of
the few first tuples that are inserted to it, that's a quite interesting
find on its own, and merits some further research.

Yeah, it's pretty strange. Using same random datasets in different tests can
help to exclude onepossible cause of difference.

In order to compare index build methods on more qualitative indexes, I've

tried to build indexes with my double sorting split method (see:
http://syrcose.ispras.ru/2011/**files/SYRCoSE2011_Proceedings.**
pdf#page=36<http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36&gt;).
So
on uniform dataset search is faster in about 10 times! And, as it was
expected, regular index build becomes much slower. It runs more than 60
hours and while only 50% of index is complete (estimated by file sizes).

Also, automatic switching to buffering build shows better index quality
results in all the tests. While it's hard for me to explain that.

Hmm, makes me a bit uneasy that we're testing with a modified page
splitting algorithm. But if the new algorithm is that good, could you post
that as a separate patch, please?

I've post it in another message and I will try to get it into more
appropriate form. Let me clarify this a little. I don't think my split
algorithm is 10 times better than state of the art algorithms. I think that
currently used new linear split shows unreasonably bad results in may cases.
For example, uniformly distributed data is pretty easy case. And with almost
any splitting algorithm we can get index with almost zero overlaps. But new
linear split produces huge overlaps in this case. That's why I decided to
make some experiments with another split algorithm.

------
With best regards,
Alexander Korotkov.

#118Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#116)
Re: WIP: Fast GiST index build

On 26.08.2011 17:18, Alexander Korotkov wrote:

On Thu, Aug 25, 2011 at 11:08 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

Could you share the test scripts, patches and data sets etc. needed to
reproduce the tests you've been running? I'd like to try them out on a test
server.

1) I've updated links to the datasets on the wiki page.
2) Script for index quality testing fastbuild_test.php is in the attachment.
In order to run it you need PHP with pdo and pdo_pgsql modules. Also
plantuner moduler is required (it is used to force planer to use specific
index). After running that script following query returns relative score of
index quality:

select indexname, avg(count::real/(select count from test_result a2 where
a2.indexname = 'usnoa2_idx3' and a2.predicate = a1.predicate and
a2.tablename = a1.tablename)::real) from test_result a1 where a1.tablename =
'usnoa2' group by indexname;

where 'usnoa2' - table name, 'usnoa2_idx3' - name of index which quality was
assumed to be 1.
3) Patch which makes plantuner work with HEAD is also in attachment.
4) Patch with my split algorithm implementation is attached. Now it's form
is appropriate only for testing purposes.
5) For indexes creation I use simple script which is attached as
'indexes.sql'. Also, similar script with different index names I'm running
with my split patch.

Feel free to ask questions about all this stuff.

Thanks! Meanwhile, I hacked together a script of my own to do
performance testing. I let it run over the weekend, but I just realized
that I forgot to vacuum the test tables so the results are not worth
much. I'm rerunning them now, stay tuned!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#119Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#116)
1 attachment(s)
Re: WIP: Fast GiST index build

On 26.08.2011 17:18, Alexander Korotkov wrote:

On Thu, Aug 25, 2011 at 11:08 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

Could you share the test scripts, patches and data sets etc. needed to
reproduce the tests you've been running? I'd like to try them out on a test
server.

1) I've updated links to the datasets on the wiki page.
2) Script for index quality testing fastbuild_test.php is in the attachment.
In order to run it you need PHP with pdo and pdo_pgsql modules. Also
plantuner moduler is required (it is used to force planer to use specific
index). After running that script following query returns relative score of
index quality:

select indexname, avg(count::real/(select count from test_result a2 where
a2.indexname = 'usnoa2_idx3' and a2.predicate = a1.predicate and
a2.tablename = a1.tablename)::real) from test_result a1 where a1.tablename =
'usnoa2' group by indexname;

where 'usnoa2' - table name, 'usnoa2_idx3' - name of index which quality was
assumed to be 1.
3) Patch which makes plantuner work with HEAD is also in attachment.
4) Patch with my split algorithm implementation is attached. Now it's form
is appropriate only for testing purposes.
5) For indexes creation I use simple script which is attached as
'indexes.sql'. Also, similar script with different index names I'm running
with my split patch.

Feel free to ask questions about all this stuff.

Thanks. Meanwhile, I hacked together my own set of test scripts, and let
them run over the weekend. I'm still running tests with ordered data,
but here are some preliminary results:

testname | nrows | duration | accesses
-----------------------------+-----------+-----------------+----------
points unordered auto | 250000000 | 08:08:39.174956 | 3757848
points unordered buffered | 250000000 | 09:29:16.47012 | 4049832
points unordered unbuffered | 250000000 | 03:48:10.999861 | 4564986

As you can see, the results are very disappointing :-(. The buffered
builds take a lot *longer* than unbuffered ones. I was expecting the
buffering to be very helpful at least in these unordered tests. On the
positive side, the buffering made index quality somewhat better
(accesses column, smaller is better), but that's not what we're aiming at.

What's going on here? This data set was large enough to not fit in RAM,
the table was about 8.5 GB in size (and I think the index is even larger
than that), and the box has 4GB of RAM. Does the buffering only help
with even larger indexes that exceed the cache size even more?

Test methodology
----------------

These tests consist of creating a gist index using the point datatype.

Table "public.points"
Column | Type | Modifiers
--------+---------+-----------
x | integer |
y | integer |

CREATE INDEX testindex ON points_ordered USING gist (point(x,y)) WITH
(buffering = 'on');

The points in the table are uniformly distributed. In the 'unordered'
tests, they are in random order. The ordered tests use the exact same
data, but sorted by x, y coordinates.

The 'accesses' column measures the quality of the produced index.
Smaller is better. It is calculated by performing a million queries on
the table, selecting points within a small square at evenly spaced
locations. Like:

(SELECT COUNT(*) FROM points WHERE point(x,y) <@ box(point(xx-20,
yy-20), point(xx+20, yy+20)));

The number of index pages touched by those queries are count from
pg_statio_user_indexes, and that number is reported in the 'accesses'
column.

I've attached the whole script used. Pass the number of rows to use in
the test as argument, and the script does the rest.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

rungisttests.shapplication/x-sh; name=rungisttests.shDownload
#120Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#119)
Re: WIP: Fast GiST index build

On 30.08.2011 12:08, Heikki Linnakangas wrote:

What's going on here? This data set was large enough to not fit in RAM,
the table was about 8.5 GB in size (and I think the index is even larger
than that), and the box has 4GB of RAM. Does the buffering only help
with even larger indexes that exceed the cache size even more?

The tests are still running, so I decided to try oprofile. The build is
in the final emptying phase, according to the log, and has been for over
half an hour now. Oprofile output looks very interesting:

samples % image name symbol name
228590 30.3045 postgres pg_qsort
200558 26.5882 postgres gistBuffersFreeBlocksCmp
49397 6.5486 postgres gistchoose
45563 6.0403 postgres gist_box_penalty
31425 4.1661 postgres AllocSetAlloc
24182 3.2058 postgres FunctionCall3Coll
22671 3.0055 postgres rt_box_union
20147 2.6709 postgres gistpenalty
17007 2.2546 postgres DirectFunctionCall2Coll
15790 2.0933 no-vmlinux /no-vmlinux
14148 1.8756 postgres XLogInsert
10612 1.4068 postgres gistdentryinit
10542 1.3976 postgres MemoryContextAlloc
9466 1.2549 postgres FunctionCall1Coll
9190 1.2183 postgres gist_box_decompress
6681 0.8857 postgres med3
4941 0.6550 libc-2.12.so isnanf

So, over 50% of the CPU time is spent in choosing a block from the
temporary files. That should be pretty easy to improve..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#121Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#120)
Re: WIP: Fast GiST index build

On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

So, over 50% of the CPU time is spent in choosing a block from the
temporary files. That should be pretty easy to improve..

Yes, probably we can just remove free blocks sorting.

------
With best regards,
Alexander Korotkov.

#122Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#119)
Re: WIP: Fast GiST index build

On Tue, Aug 30, 2011 at 1:08 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Thanks. Meanwhile, I hacked together my own set of test scripts, and let
them run over the weekend. I'm still running tests with ordered data, but
here are some preliminary results:

testname | nrows | duration | accesses
-----------------------------+**-----------+-----------------+**----------
points unordered auto | 250000000 | 08:08:39.174956 | 3757848
points unordered buffered | 250000000 | 09:29:16.47012 | 4049832
points unordered unbuffered | 250000000 | 03:48:10.999861 | 4564986

As you can see, the results are very disappointing :-(. The buffered builds
take a lot *longer* than unbuffered ones. I was expecting the buffering to
be very helpful at least in these unordered tests. On the positive side, the
buffering made index quality somewhat better (accesses column, smaller is
better), but that's not what we're aiming at.

What's going on here? This data set was large enough to not fit in RAM, the
table was about 8.5 GB in size (and I think the index is even larger than
that), and the box has 4GB of RAM. Does the buffering only help with even
larger indexes that exceed the cache size even more?

This seems pretty strange for me. Time of unbuffered index build shows that
there is not bottleneck at IO. That radically differs from my
experiments. I'm going to try your test script on my test setup.
While I have only express assumption that random function appears to be
somewhat bad. Thereby unordered dataset behave like the ordered one. Can you
rerun tests on your test setup with dataset generation on the backend like
this?
CREATE TABLE points AS (SELECT point(random(), random() FROM
generate_series(1,10000000));

------
With best regards,
Alexander Korotkov.

#123Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#121)
Re: WIP: Fast GiST index build

On 30.08.2011 13:29, Alexander Korotkov wrote:

On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

So, over 50% of the CPU time is spent in choosing a block from the
temporary files. That should be pretty easy to improve..

Yes, probably we can just remove free blocks sorting.

I'm re-running the tests with that change now. It seems like using the
list of free blocks as a simple stack would be better anyway, it
probably yields a better cache hit ratio when we re-use blocks that have
just been released.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#124Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#122)
Re: WIP: Fast GiST index build

On 30.08.2011 13:38, Alexander Korotkov wrote:

On Tue, Aug 30, 2011 at 1:08 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

Thanks. Meanwhile, I hacked together my own set of test scripts, and let
them run over the weekend. I'm still running tests with ordered data, but
here are some preliminary results:

testname | nrows | duration | accesses
-----------------------------+**-----------+-----------------+**----------
points unordered auto | 250000000 | 08:08:39.174956 | 3757848
points unordered buffered | 250000000 | 09:29:16.47012 | 4049832
points unordered unbuffered | 250000000 | 03:48:10.999861 | 4564986

As you can see, the results are very disappointing :-(. The buffered builds
take a lot *longer* than unbuffered ones. I was expecting the buffering to
be very helpful at least in these unordered tests. On the positive side, the
buffering made index quality somewhat better (accesses column, smaller is
better), but that's not what we're aiming at.

What's going on here? This data set was large enough to not fit in RAM, the
table was about 8.5 GB in size (and I think the index is even larger than
that), and the box has 4GB of RAM. Does the buffering only help with even
larger indexes that exceed the cache size even more?

This seems pretty strange for me. Time of unbuffered index build shows that
there is not bottleneck at IO. That radically differs from my
experiments. I'm going to try your test script on my test setup.
While I have only express assumption that random function appears to be
somewhat bad. Thereby unordered dataset behave like the ordered one.

Oh. Doing a simple "SELECT * FROM points LIMIT 10", it looks pretty
random to me. The data should be uniformly distributed in a rectangle
from (0, 0) to (100000, 100000).

Can you
rerun tests on your test setup with dataset generation on the backend like
this?
CREATE TABLE points AS (SELECT point(random(), random() FROM
generate_series(1,10000000));

Ok, I'll queue up that test after the ones I'm running now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#125Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#121)
Re: WIP: Fast GiST index build

On 30.08.2011 13:29, Alexander Korotkov wrote:

On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

So, over 50% of the CPU time is spent in choosing a block from the
temporary files. That should be pretty easy to improve..

Yes, probably we can just remove free blocks sorting.

Ok, the first results are in for that:

testname | nrows | duration | accesses
---------------------------+-----------+-----------------+----------
points unordered buffered | 250000000 | 06:00:23.707579 | 4049832

From the previous test runs, the unbuffered index build took under 4
hours, so even though this is a lot better than with the sorting, it's
still a loss compared to non-buffered build.

I had vmstat running during most of this index build. At a quick glance,
it doesn't seem to be CPU bound anymore. I suspect the buffers in the
temporary file gets very fragmented. Or, we're reading it in backwards
order because the buffers work in a LIFO fashion. The system seems to be
doing about 5 MB/s of I/O during the build, which sounds like a figure
you'd get for more or less random I/O.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#126Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#125)
Re: WIP: Fast GiST index build

On Tue, Aug 30, 2011 at 9:29 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 30.08.2011 13:29, Alexander Korotkov wrote:

On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas<
heikki.linnakangas@**enterprisedb.com<heikki.linnakangas@enterprisedb.com>>
wrote:

So, over 50% of the CPU time is spent in choosing a block from the

temporary files. That should be pretty easy to improve..

Yes, probably we can just remove free blocks sorting.

Ok, the first results are in for that:

testname | nrows | duration | accesses
---------------------------+--**---------+-----------------+--**--------
points unordered buffered | 250000000 | 06:00:23.707579 | 4049832

From the previous test runs, the unbuffered index build took under 4 hours,
so even though this is a lot better than with the sorting, it's still a loss
compared to non-buffered build.

I had vmstat running during most of this index build. At a quick glance, it
doesn't seem to be CPU bound anymore. I suspect the buffers in the temporary
file gets very fragmented. Or, we're reading it in backwards order because
the buffers work in a LIFO fashion. The system seems to be doing about 5
MB/s of I/O during the build, which sounds like a figure you'd get for more
or less random I/O.

So, we still have two questions:
1) Why buffering build algorithm doesn't show any benefit on these tests?
2) Why test results on your test setup differs from test results on my test
setup?

I can propose following answers now:
1) I think it's because high overlaps in the tree. As I mentioned before
high overlaps can cause only fraction of the tree to be used for actual
inserts. For comparison, with my split algorithm (which produce almost no
overlaps on uniform dataset) buffering index build took 4 hours, while
regular build is still running (already more than 8 days = 192 hours)!
2) Probably it's because different behavour of OS cache. For example, on my
test setup OS displace unused pages from cache too slowly. Thereby buffering
algorithm showed benefit nevertheless.

Also it seems to me that I start to understand problem of new linear
splitting algorithm. On dataset with 1M rows it produces almost no overlaps
while it produces significant overlaps already on 10M rows (drama!).
Probably nobody tested it on large enough datasets (neither while original
research or before commit). I'll dig it in more details and provide some
testing results.

------
With best regards,
Alexander Korotkov.

#127Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#122)
Re: WIP: Fast GiST index build

On 30.08.2011 13:38, Alexander Korotkov wrote:

On Tue, Aug 30, 2011 at 1:08 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

Thanks. Meanwhile, I hacked together my own set of test scripts, and let
them run over the weekend. I'm still running tests with ordered data, but
here are some preliminary results:

testname | nrows | duration | accesses
-----------------------------+**-----------+-----------------+**----------
points unordered auto | 250000000 | 08:08:39.174956 | 3757848
points unordered buffered | 250000000 | 09:29:16.47012 | 4049832
points unordered unbuffered | 250000000 | 03:48:10.999861 | 4564986

As you can see, the results are very disappointing :-(. The buffered builds
take a lot *longer* than unbuffered ones. I was expecting the buffering to
be very helpful at least in these unordered tests. On the positive side, the
buffering made index quality somewhat better (accesses column, smaller is
better), but that's not what we're aiming at.

What's going on here? This data set was large enough to not fit in RAM, the
table was about 8.5 GB in size (and I think the index is even larger than
that), and the box has 4GB of RAM. Does the buffering only help with even
larger indexes that exceed the cache size even more?

This seems pretty strange for me. Time of unbuffered index build shows that
there is not bottleneck at IO. That radically differs from my
experiments. I'm going to try your test script on my test setup.
While I have only express assumption that random function appears to be
somewhat bad. Thereby unordered dataset behave like the ordered one. Can you
rerun tests on your test setup with dataset generation on the backend like
this?
CREATE TABLE points AS (SELECT point(random(), random() FROM
generate_series(1,10000000));

So I changed the test script to generate the table as:

CREATE TABLE points AS SELECT random() as x, random() as y FROM
generate_series(1, $NROWS);

The unordered results are in:

testname | nrows | duration | accesses
-----------------------------+-----------+-----------------+----------
points unordered buffered | 250000000 | 05:56:58.575789 | 2241050
points unordered auto | 250000000 | 05:34:12.187479 | 2246420
points unordered unbuffered | 250000000 | 04:38:48.663952 | 2244228

Although the buffered build doesn't lose as badly as it did with more
overlap, it still doesn't look good :-(. Any ideas?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#128Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#127)
Re: WIP: Fast GiST index build

On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

So I changed the test script to generate the table as:

CREATE TABLE points AS SELECT random() as x, random() as y FROM
generate_series(1, $NROWS);

The unordered results are in:

testname | nrows | duration | accesses
-----------------------------+**-----------+-----------------+**----------
points unordered buffered | 250000000 | 05:56:58.575789 | 2241050
points unordered auto | 250000000 | 05:34:12.187479 | 2246420
points unordered unbuffered | 250000000 | 04:38:48.663952 | 2244228

Although the buffered build doesn't lose as badly as it did with more
overlap, it still doesn't look good :-(. Any ideas?

But it's still a lot of overlap. It's about 220 accesses per small area
request. It's about 10 - 20 times greater than should be without overlaps.
If we roughly assume that 10 times more overlap makes 1/10 of tree to be
used for actual inserts, then that part of tree can easily fit to the cache.
You can try my splitting algorithm on your test setup (it this case I advice
to start from smaller number of rows, 100 M for example).
I'm requesting real-life datasets which makes troubles in real life from
Oleg. Probably those datasets is even larger or new linear split produce
less overlaps on them.

------
With best regards,
Alexander Korotkov.

#129Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#128)
Re: WIP: Fast GiST index build

On 01.09.2011 12:23, Alexander Korotkov wrote:

On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

So I changed the test script to generate the table as:

CREATE TABLE points AS SELECT random() as x, random() as y FROM
generate_series(1, $NROWS);

The unordered results are in:

testname | nrows | duration | accesses
-----------------------------+**-----------+-----------------+**----------
points unordered buffered | 250000000 | 05:56:58.575789 | 2241050
points unordered auto | 250000000 | 05:34:12.187479 | 2246420
points unordered unbuffered | 250000000 | 04:38:48.663952 | 2244228

Although the buffered build doesn't lose as badly as it did with more
overlap, it still doesn't look good :-(. Any ideas?

But it's still a lot of overlap. It's about 220 accesses per small area
request. It's about 10 - 20 times greater than should be without overlaps.

Hmm, those "accesses" numbers are actually quite bogus for this test. I
changed the creation of the table as you suggested, so that all x and y
values are in the range 0.0 - 1.0, but I didn't change the loop to
calculate those accesses, so it still queried for boxes in the range 0 -
100000. That makes me wonder, why does it need 220 accesses on average
to satisfy queries most of which lie completely outside the range of
actual values in the index? I would expect such queries to just look at
the root node, conclude that there can't be any matching tuples, and
return immediately.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#130Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#128)
1 attachment(s)
Re: WIP: Fast GiST index build

On 01.09.2011 12:23, Alexander Korotkov wrote:

On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

So I changed the test script to generate the table as:

CREATE TABLE points AS SELECT random() as x, random() as y FROM
generate_series(1, $NROWS);

The unordered results are in:

testname | nrows | duration | accesses
-----------------------------+**-----------+-----------------+**----------
points unordered buffered | 250000000 | 05:56:58.575789 | 2241050
points unordered auto | 250000000 | 05:34:12.187479 | 2246420
points unordered unbuffered | 250000000 | 04:38:48.663952 | 2244228

Although the buffered build doesn't lose as badly as it did with more
overlap, it still doesn't look good :-(. Any ideas?

But it's still a lot of overlap. It's about 220 accesses per small area
request. It's about 10 - 20 times greater than should be without overlaps.
If we roughly assume that 10 times more overlap makes 1/10 of tree to be
used for actual inserts, then that part of tree can easily fit to the cache.
You can try my splitting algorithm on your test setup (it this case I advice
to start from smaller number of rows, 100 M for example).
I'm requesting real-life datasets which makes troubles in real life from
Oleg. Probably those datasets is even larger or new linear split produce
less overlaps on them.

I made a small tweak to the patch, and got much better results (this is
with my original method of generating the data):

testname | nrows | duration | accesses
-----------------------------+-----------+-----------------+----------
points unordered buffered | 250000000 | 03:34:23.488275 | 3945486
points unordered auto | 250000000 | 02:55:10.248722 | 3767548
points unordered unbuffered | 250000000 | 04:02:04.168138 | 4564986

The tweak I made was to the way buffers are emptied in the final
emptying phase. Previously, it repeatedly looped through all the buffers
at a level, until there were no more non-empty buffers at the level.
When a buffer was split while it was being emptied, processing that
buffer stopped, and the emptying process moved on to the next buffer. I
changed it so that when a buffer splits, we continue emptying that
buffer until it's completely empty. That behavior is much more
cache-friendly, which shows as much better overall performance.

I probably changed that behavior for the worse in previous my rounds of
cleanup. Anyway, attached is the patch I used to get the above numbers.
Now that the performance problem is fixed, I'll continue reviewing and
cleaning up the patch.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

gist_fast_build-0.14.2-heikki-2.patchtext/x-diff; name=gist_fast_build-0.14.2-heikki-2.patchDownload
diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index 78171cf..3120bf9 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -642,6 +642,38 @@ my_distance(PG_FUNCTION_ARGS)
 
   </variablelist>
 
+ <sect2 id="gist-buffering-build">
+  <title>GiST buffering build</title>
+  <para>
+   Building large GiST indexes by simply inserting all the tuples tends to be
+   slow, because if the index tuples are scattered across the index and the
+   index is large enough to not fit in cache, the insertions need to perform
+   a lot of random I/O. PostgreSQL from version 9.2 supports a more efficient
+   method to build GiST indexes based on buffering, which can dramatically
+   reduce number of random I/O needed for non-ordered data sets. For
+   well-ordered datasets the benefit is smaller or non-existent, because
+   only a small number of pages receive new tuples at a time, and those pages
+   fit in cache even if the index as whole does not.
+  </para>
+
+  <para>
+   However, buffering index build needs to call the <function>penalty</>
+   function more often, which consumes some extra CPU resources. Also, it can
+   infuence the quality of the produced index, in both positive and negative
+   directions. That influence depends on various factors, like the
+   distribution of the input data and operator class implementation.
+  </para>
+
+  <para>
+   By default, the index build switches to the buffering method when the
+   index size reaches <xref linkend="guc-effective-cache-size">. It can
+   be manually turned on or off by the <literal>BUFFERING</literal> parameter
+   to the CREATE INDEX clause. The default behavior is good for most cases,
+   but turning buffering off might speed up the build somewhat if the input
+   data is ordered.
+  </para>
+
+ </sect2>
 </sect1>
 
 <sect1 id="gist-examples">
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 1a1e8d6..2cfc9f3 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -341,6 +341,26 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ <replaceable class="parameter">name</
    </varlistentry>
 
    </variablelist>
+   <para>
+    GiST indexes additionaly accepts parameters:
+   </para>
+
+   <variablelist>
+
+   <varlistentry>
+    <term><literal>BUFFERING</></term>
+    <listitem>
+    <para>
+     Determines whether the buffering build technique described in
+     <xref linkend="gist-buffering-build"> is used to build the index. With
+     <literal>OFF</> it is disabled, with <literal>ON</> it is enabled, and
+     with <literal>AUTO</> it is initially disabled, but turned on
+     on-the-fly once the index size reaches <xref linkend="guc-effective-cache-size">. The default is <literal>AUTO</>.
+    </para>
+    </listitem>
+   </varlistentry>
+
+   </variablelist>
   </refsect2>
 
   <refsect2 id="SQL-CREATEINDEX-CONCURRENTLY">
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 900b222..240e178 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -219,6 +219,17 @@ static relopt_real realRelOpts[] =
 
 static relopt_string stringRelOpts[] =
 {
+	{
+		{
+			"buffering",
+			"Enables buffering build for this GiST index",
+			RELOPT_KIND_GIST
+		},
+		4,
+		false,
+		gistValidateBufferingOption,
+		"auto"
+	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/gist/Makefile b/src/backend/access/gist/Makefile
index f8051a2..cc9468f 100644
--- a/src/backend/access/gist/Makefile
+++ b/src/backend/access/gist/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = gist.o gistutil.o gistxlog.o gistvacuum.o gistget.o gistscan.o \
-       gistproc.o gistsplit.o
+       gistproc.o gistsplit.o gistbuild.o gistbuildbuffers.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 2d78dcb..533be22 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -24,6 +24,7 @@ The current implementation of GiST supports:
   * provides NULL-safe interface to GiST core
   * Concurrency
   * Recovery support via WAL logging
+  * Buffering build algorithm
 
 The support for concurrency implemented in PostgreSQL was developed based on
 the paper "Access Methods for Next-Generation Database Systems" by
@@ -31,6 +32,12 @@ Marcel Kornaker:
 
     http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz
 
+Buffering build algorithm for GiST was developed based on the paper "Efficient
+Bulk Operations on Dynamic R-trees" by Lars Arge, Klaus Hinrichs, Jan Vahrenhold
+and Jeffrey Scott Vitter.
+
+    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9894&rep=rep1&type=pdf
+
 The original algorithms were modified in several ways:
 
 * They had to be adapted to PostgreSQL conventions. For example, the SEARCH
@@ -278,6 +285,134 @@ would complicate the insertion algorithm. So when an insertion sees a page
 with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
 crashed in the middle to completion by adding the downlink in the parent.
 
+Buffering build algorithm
+-------------------------
+
+In the buffering index build algorithm, some or all internal nodes have a
+buffer attached to them. When a tuple is inserted at the top, the descend down
+the tree is stopped as soon as a buffer is reached, and the tuple is pushed to
+the buffer. When a buffer gets too full, all the tuples in it are flushed to
+the lower level, where they again hit lower level buffers or leaf pages. This
+makes the insertions happen in more of a breadth-first than depth-first order,
+which greatly reduces the amount of random I/O required.
+
+In the algorithm, levels are numbered so that leaf pages have level zero,
+and internal node levels count up from 1. This numbering ensures that a page's
+level number never changes, even when the root page is split.
+
+Level                    Tree
+
+3                         *
+                      /       \
+2                *                 *
+              /  |  \           /  |  \
+1          *     *     *     *     *     *
+          / \   / \   / \   / \   / \   / \
+0        o   o o   o o   o o   o o   o o   o
+
+* - internal page
+o - leaf page
+
+Internal pages that belong to certain levels have buffers associated with
+them. Leaf pages never have buffers. Which levels have buffers is controlled
+by "level step" parameter: level numbers that are multiples of level_step
+have buffers, while others do not. For example, if level_step = 2, then
+pages on levels 2, 4, 6, ... have buffers. If level_step = 1 then every
+internal page has a buffer.
+
+Level        Tree (level_step = 1)                Tree (level_step = 2)
+
+3                      *(b)                                  *
+                   /       \                             /       \
+2             *(b)              *(b)                *(b)              *(b)
+           /  |  \           /  |  \             /  |  \           /  |  \
+1       *(b)  *(b)  *(b)  *(b)  *(b)  *(b)    *     *     *     *     *     *
+       / \   / \   / \   / \   / \   / \     / \   / \   / \   / \   / \   / \
+0     o   o o   o o   o o   o o   o o   o   o   o o   o o   o o   o o   o o   o
+
+(b) - buffer
+
+Logically, a buffer is just bunch of tuples. Physically, it is divided in
+pages, backed by a temporary file. Each buffer can be in one of two states:
+a) Last page of the buffer is kept in main memory. A node buffer is
+automatically switched to this state when a new index tuple is added to it,
+or a tuple is removed from it.
+b) All pages of the buffer are swapped out to disk. When a buffer becomes too
+full, and we start to flush it, all other buffers are switched to this state.
+
+When an index tuple is inserted, its initial processing can end in one of the
+following points:
+1) Leaf page, if the depth of the index <= level_step, meaning that
+   none of the internal pages have buffers associated with them.
+2) Buffer of topmost level page that has buffers.
+
+New index tuples are processed until one of the buffers in the topmost
+buffered level becomes half-full. When a buffer becomes half-full, it's added
+to the emptying queue, and will be emptied before a new tuple is processed.
+
+Buffer emptying process means that index tuples from the buffer are moved
+into buffers at a lower level, or leaf pages. First, all the other buffers are
+swapped to disk to free up the memory. Then tuples are popped from the buffer
+one by one, and cascaded down the tree to the next buffer or leaf page below
+the buffered node.
+
+Emptying a buffer has the interesting dynamic property that any intermediate
+pages between the buffer being emptied, and the next buffered or leaf level
+below it, become cached. If there are no more buffers below the node, the leaf
+pages where the tuples finally land on get cached too. If there are, the last
+buffer page of each buffer below is kept in memory. This is illustrated in
+the figures below:
+
+   Buffer being emptied to
+     lower-level buffers               Buffer being emptied to leaf pages
+
+               +(fb)                                 +(fb)
+            /     \                                /     \
+        +             +                        +             +
+      /   \         /   \                    /   \         /   \
+    *(ab)   *(ab) *(ab)   *(ab)            x       x     x       x
+
++    - cached internal page
+x    - cached leaf page
+*    - non-cached internal page
+(fb) - buffer being emptied
+(ab) - buffers being appended to, with last page in memory
+
+In the beginning of the index build, the level-step is chosen so that all those
+pages involved in emptying one buffer fit in cache, so after each of those
+pages have been accessed once and cached, emptying a buffer doesn't involve
+any more I/O. This locality is where the speedup of the buffering algorithm
+comes from.
+
+Emptying one buffer can fill up one or more of the lower-level buffers,
+triggering emptying of them as well. Whenever a buffer becomes too full, it's
+added to the emptying queue, and will be emptied after the current buffer has
+been processed.
+
+To keep the size of each buffer limited even in the worst case, buffer emptying
+is scheduled as soon as a buffer becomes half-full, and emptying it continues
+until 1/2 of the nominal buffer size worth of tuples has been emptied. This
+guarantees that when buffer emptying begins, all the lower-level buffers
+are at most half-full. In the worst case that all the tuples are cascaded down
+to the same lower-level buffer, that buffer therefore has enough space to
+accommodate all the tuples emptied from the upper-level buffer. There is no
+hard size limit in any of the data structures used, though, so this only needs
+to be approximate; small overfilling of some buffers doesn't matter.
+
+If an internal page that has a buffer associated with it is split, the buffer
+needs to be split too. All tuples in the buffer are scanned through and
+relocated to the correct sibling buffers, using the penalty function to decide
+which buffer each tuple should go to.
+
+After all tuples from the heap have been processed, there are still some index
+tuples in the buffers. At this point, final buffer emptying starts. All buffers
+are emptied in top-down order. This is slightly complicated by the fact that
+new buffers can be allocated during the emptying, due to page splits. However,
+the new buffers will always be siblings of buffers that haven't been fully
+emptied yet; tuples never move upwards in the tree. The final emptying loops
+through buffers at a given level until all buffers at that level have been
+emptied, and then moves down to the next level.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 4fc7a21..2fa2bf3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -24,33 +24,7 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
-/* Working state for gistbuild and its callback */
-typedef struct
-{
-	GISTSTATE	giststate;
-	int			numindexattrs;
-	double		indtuples;
-	MemoryContext tmpCtx;
-} GISTBuildState;
-
-/* A List of these is used represent a split-in-progress. */
-typedef struct
-{
-	Buffer		buf;			/* the split page "half" */
-	IndexTuple	downlink;		/* downlink for this half. */
-} GISTPageSplitInfo;
-
 /* non-export function prototypes */
-static void gistbuildCallback(Relation index,
-				  HeapTuple htup,
-				  Datum *values,
-				  bool *isnull,
-				  bool tupleIsAlive,
-				  void *state);
-static void gistdoinsert(Relation r,
-			 IndexTuple itup,
-			 Size freespace,
-			 GISTSTATE *GISTstate);
 static void gistfixsplit(GISTInsertState *state, GISTSTATE *giststate);
 static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 				 GISTSTATE *giststate,
@@ -89,138 +63,6 @@ createTempGistContext(void)
 }
 
 /*
- * Routine to build an index.  Basically calls insert over and over.
- *
- * XXX: it would be nice to implement some sort of bulk-loading
- * algorithm, but it is not clear how to do that.
- */
-Datum
-gistbuild(PG_FUNCTION_ARGS)
-{
-	Relation	heap = (Relation) PG_GETARG_POINTER(0);
-	Relation	index = (Relation) PG_GETARG_POINTER(1);
-	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
-	IndexBuildResult *result;
-	double		reltuples;
-	GISTBuildState buildstate;
-	Buffer		buffer;
-	Page		page;
-
-	/*
-	 * We expect to be called exactly once for any index relation. If that's
-	 * not the case, big trouble's what we have.
-	 */
-	if (RelationGetNumberOfBlocks(index) != 0)
-		elog(ERROR, "index \"%s\" already contains data",
-			 RelationGetRelationName(index));
-
-	/* no locking is needed */
-	initGISTstate(&buildstate.giststate, index);
-
-	/* initialize the root page */
-	buffer = gistNewBuffer(index);
-	Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
-	page = BufferGetPage(buffer);
-
-	START_CRIT_SECTION();
-
-	GISTInitBuffer(buffer, F_LEAF);
-
-	MarkBufferDirty(buffer);
-
-	if (RelationNeedsWAL(index))
-	{
-		XLogRecPtr	recptr;
-		XLogRecData rdata;
-
-		rdata.data = (char *) &(index->rd_node);
-		rdata.len = sizeof(RelFileNode);
-		rdata.buffer = InvalidBuffer;
-		rdata.next = NULL;
-
-		recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_CREATE_INDEX, &rdata);
-		PageSetLSN(page, recptr);
-		PageSetTLI(page, ThisTimeLineID);
-	}
-	else
-		PageSetLSN(page, GetXLogRecPtrForTemp());
-
-	UnlockReleaseBuffer(buffer);
-
-	END_CRIT_SECTION();
-
-	/* build the index */
-	buildstate.numindexattrs = indexInfo->ii_NumIndexAttrs;
-	buildstate.indtuples = 0;
-
-	/*
-	 * create a temporary memory context that is reset once for each tuple
-	 * inserted into the index
-	 */
-	buildstate.tmpCtx = createTempGistContext();
-
-	/* do the heap scan */
-	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
-								   gistbuildCallback, (void *) &buildstate);
-
-	/* okay, all heap tuples are indexed */
-	MemoryContextDelete(buildstate.tmpCtx);
-
-	freeGISTstate(&buildstate.giststate);
-
-	/*
-	 * Return statistics
-	 */
-	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
-
-	result->heap_tuples = reltuples;
-	result->index_tuples = buildstate.indtuples;
-
-	PG_RETURN_POINTER(result);
-}
-
-/*
- * Per-tuple callback from IndexBuildHeapScan
- */
-static void
-gistbuildCallback(Relation index,
-				  HeapTuple htup,
-				  Datum *values,
-				  bool *isnull,
-				  bool tupleIsAlive,
-				  void *state)
-{
-	GISTBuildState *buildstate = (GISTBuildState *) state;
-	IndexTuple	itup;
-	MemoryContext oldCtx;
-
-	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
-
-	/* form an index tuple and point it at the heap tuple */
-	itup = gistFormTuple(&buildstate->giststate, index,
-						 values, isnull, true /* size is currently bogus */ );
-	itup->t_tid = htup->t_self;
-
-	/*
-	 * Since we already have the index relation locked, we call gistdoinsert
-	 * directly.  Normal access method calls dispatch through gistinsert,
-	 * which locks the relation for write.	This is the right thing to do if
-	 * you're inserting single tups, but not when you're initializing the
-	 * whole index at once.
-	 *
-	 * In this path we respect the fillfactor setting, whereas insertions
-	 * after initial build do not.
-	 */
-	gistdoinsert(index, itup,
-			  RelationGetTargetPageFreeSpace(index, GIST_DEFAULT_FILLFACTOR),
-				 &buildstate->giststate);
-
-	buildstate->indtuples += 1;
-	MemoryContextSwitchTo(oldCtx);
-	MemoryContextReset(buildstate->tmpCtx);
-}
-
-/*
  *	gistbuildempty() -- build an empty gist index in the initialization fork
  */
 Datum
@@ -293,8 +135,10 @@ gistinsert(PG_FUNCTION_ARGS)
  * In that case, we continue to hold the root page locked, and the child
  * pages are released; note that new tuple(s) are *not* on the root page
  * but in one of the new child pages.
+ *
+ * Returns 'true' if the page was split, 'false' otherwise.
  */
-static bool
+bool
 gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 				Buffer buffer,
 				IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
@@ -474,7 +318,15 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 			else
 				GistPageGetOpaque(ptr->page)->rightlink = oldrlink;
 
-			if (ptr->next && !is_rootsplit)
+			/*
+			 * Mark the all but the right-most page with the follow-right
+			 * flag. It will be cleared as soon as the downlink is inserted
+			 * into the parent, but this ensures that if we error out before
+			 * that, the index is still consistent. (in buffering build mode,
+			 * any error will abort the index build anyway, so this is not
+			 * needed.)
+			 */
+			if (ptr->next && !is_rootsplit && !giststate->gfbb)
 				GistMarkFollowRight(ptr->page);
 			else
 				GistClearFollowRight(ptr->page);
@@ -508,7 +360,8 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 		/* Write the WAL record */
 		if (RelationNeedsWAL(state->r))
 			recptr = gistXLogSplit(state->r->rd_node, blkno, is_leaf,
-								   dist, oldrlink, oldnsn, leftchildbuf);
+								   dist, oldrlink, oldnsn, leftchildbuf,
+								   giststate->gfbb ? true : false);
 		else
 			recptr = GetXLogRecPtrForTemp();
 
@@ -570,8 +423,6 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 			recptr = GetXLogRecPtrForTemp();
 			PageSetLSN(page, recptr);
 		}
-
-		*splitinfo = NIL;
 	}
 
 	/*
@@ -608,7 +459,7 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
  * this routine assumes it is invoked in a short-lived memory context,
  * so it does not bother releasing palloc'd allocations.
  */
-static void
+void
 gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 {
 	ItemId		iid;
@@ -1414,6 +1265,7 @@ initGISTstate(GISTSTATE *giststate, Relation index)
 		else
 			giststate->supportCollation[i] = DEFAULT_COLLATION_OID;
 	}
+	giststate->gfbb = NULL;
 }
 
 void
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
new file mode 100644
index 0000000..7cdbf11
--- /dev/null
+++ b/src/backend/access/gist/gistbuild.c
@@ -0,0 +1,1046 @@
+/*-------------------------------------------------------------------------
+ *
+ * gistbuild.c
+ *	  build algorithm for GiST indexes implementation.
+ *
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/gist/gistbuild.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/gist_private.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "miscadmin.h"
+#include "optimizer/cost.h"
+#include "storage/bufmgr.h"
+#include "storage/indexfsm.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+/* Step of index tuples for check whether to switch to buffering build mode */
+#define BUFFERING_MODE_SWITCH_CHECK_STEP 256
+
+/*
+ * Number of tuples to process in the slow way before switching to buffering
+ * mode, when buffering is explicitly turned on. Also, the number of tuples
+ * to process between readjusting the buffer size parameter, while in
+ * buffering mode.
+ */
+#define BUFFERING_MODE_TUPLE_SIZE_STATS_TARGET 4096
+
+typedef enum
+{
+	GIST_BUFFERING_DISABLED,	/* in regular build mode and aren't going to
+								 * switch */
+	GIST_BUFFERING_AUTO,		/* in regular build mode, but will switch to
+								 * buffering build mode if the index grows
+								 * too big */
+	GIST_BUFFERING_STATS,		/* gathering statistics of index tuple size
+								 * before switching to the buffering build
+								 * mode */
+	GIST_BUFFERING_ACTIVE		/* in buffering build mode */
+} GistBufferingMode;
+
+/* Working state for gistbuild and its callback */
+typedef struct
+{
+	GISTSTATE	giststate;
+	int64		indtuples;
+	int64		indtuplesSize;
+
+	Size		freespace;	/* Amount of free space to leave on pages */
+
+	GistBufferingMode bufferingMode;
+	MemoryContext tmpCtx;
+} GISTBuildState;
+
+static void gistFreeUnreferencedPath(GISTBufferingInsertStack *path);
+static bool gistProcessItup(GISTSTATE *giststate, GISTInsertState *state,
+				GISTBuildBuffers *gfbb, IndexTuple itup,
+				GISTBufferingInsertStack *startparent);
+static void gistProcessEmptyingStack(GISTSTATE *giststate, GISTInsertState *state);
+static void gistBufferingBuildInsert(Relation index, IndexTuple itup,
+						 GISTBuildState *buildstate);
+static void gistBuildCallback(Relation index,
+				  HeapTuple htup,
+				  Datum *values,
+				  bool *isnull,
+				  bool tupleIsAlive,
+				  void *state);
+static int	gistGetMaxLevel(Relation index);
+static bool gistInitBuffering(GISTBuildState *buildstate, Relation index);
+static int	calculatePagesPerBuffer(GISTBuildState *buildstate, Relation index,
+						int levelStep);
+static void gistbufferinginserttuples(GISTInsertState *state, GISTSTATE *giststate,
+				Buffer buffer,
+				IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
+				GISTBufferingInsertStack *path);
+static void gistBufferingFindCorrectParent(GISTSTATE *giststate, Relation r,
+							   GISTBufferingInsertStack *child);
+
+/*
+ * Main entry point to GiST indexbuild. Initially calls insert over and over, 
+ * but switches to more efficient buffering build algorithm after a certain
+ * number of tuples (unless buffering mode is disabled).
+ */
+Datum
+gistbuild(PG_FUNCTION_ARGS)
+{
+	Relation	heap = (Relation) PG_GETARG_POINTER(0);
+	Relation	index = (Relation) PG_GETARG_POINTER(1);
+	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+	IndexBuildResult *result;
+	double		reltuples;
+	GISTBuildState buildstate;
+	Buffer		buffer;
+	Page		page;
+	MemoryContext oldcxt = CurrentMemoryContext;
+
+	buildstate.freespace = RelationGetTargetPageFreeSpace(index,
+													  GIST_DEFAULT_FILLFACTOR);
+
+	if (index->rd_options)
+	{
+		/* Get buffering mode from the options string */
+		GiSTOptions *options = (GiSTOptions *) index->rd_options;
+		char	   *bufferingMode = (char *) options + options->bufferingModeOffset;
+
+		if (strcmp(bufferingMode, "on") == 0)
+			buildstate.bufferingMode = GIST_BUFFERING_STATS;
+		else if (strcmp(bufferingMode, "off") == 0)
+			buildstate.bufferingMode = GIST_BUFFERING_DISABLED;
+		else
+			buildstate.bufferingMode = GIST_BUFFERING_AUTO;
+	}
+	else
+	{
+		/* Automatic buffering mode switching by default */
+		buildstate.bufferingMode = GIST_BUFFERING_AUTO;
+	}
+
+	/*
+	 * We expect to be called exactly once for any index relation. If that's
+	 * not the case, big trouble's what we have.
+	 */
+	if (RelationGetNumberOfBlocks(index) != 0)
+		elog(ERROR, "index \"%s\" already contains data",
+			 RelationGetRelationName(index));
+
+	/* no locking is needed */
+	initGISTstate(&buildstate.giststate, index);
+
+	/* initialize the root page */
+	buffer = gistNewBuffer(index);
+	Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
+	page = BufferGetPage(buffer);
+
+	START_CRIT_SECTION();
+
+	GISTInitBuffer(buffer, F_LEAF);
+
+	MarkBufferDirty(buffer);
+
+	if (RelationNeedsWAL(index))
+	{
+		XLogRecPtr	recptr;
+		XLogRecData rdata;
+
+		rdata.data = (char *) &(index->rd_node);
+		rdata.len = sizeof(RelFileNode);
+		rdata.buffer = InvalidBuffer;
+		rdata.next = NULL;
+
+		recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_CREATE_INDEX, &rdata);
+		PageSetLSN(page, recptr);
+		PageSetTLI(page, ThisTimeLineID);
+	}
+	else
+		PageSetLSN(page, GetXLogRecPtrForTemp());
+
+	UnlockReleaseBuffer(buffer);
+
+	END_CRIT_SECTION();
+
+	/* build the index */
+	buildstate.indtuples = 0;
+	buildstate.indtuplesSize = 0;
+
+	/*
+	 * create a temporary memory context that is reset once for each tuple
+	 * processed.
+	 */
+	buildstate.tmpCtx = createTempGistContext();
+
+	/*
+	 * Do the heap scan.
+	 */
+	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
+								   gistBuildCallback, (void *) &buildstate);
+
+	/*
+	 * If buffering build was used, flush out all the tuples that are still
+	 * in the buffers.
+	 */
+	if (buildstate.bufferingMode == GIST_BUFFERING_ACTIVE)
+	{
+		int			i;
+		GISTInsertState insertstate;
+		GISTNodeBuffer *nodeBuffer;
+		MemoryContext oldCtx;
+		GISTBuildBuffers *gfbb = buildstate.giststate.gfbb;
+
+		elog(DEBUG1, "all tuples processed, emptying buffers");
+
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+
+		memset(&insertstate, 0, sizeof(GISTInsertState));
+		insertstate.freespace = buildstate.freespace;
+		insertstate.r = index;
+
+		/*
+		 * Iterate through the levels from the most higher.
+		 */
+		for (i = gfbb->buffersOnLevelsLen - 1; i >= 0; i--)
+		{
+			/*
+			 * Empty all buffers on this level. We repeatedly loop through all
+			 * the buffers on this level, until we observe that all the
+			 * buffers are empty. Looping through the list once is not enough,
+			 * because emptying one buffer can cause pages to split and new
+			 * buffers to be created on the same (and lower) level.
+			 *
+			 * We remove buffers from the list when we see it empty. A buffer
+			 * can't become non-empty once it's been fully emptied.
+			 */
+			while (gfbb->buffersOnLevels[i] != NIL)
+			{
+				nodeBuffer = (GISTNodeBuffer *) linitial(gfbb->buffersOnLevels[i]);
+
+				if (nodeBuffer->blocksCount != 0)
+				{
+					/* Process emptying of node buffer */
+					MemoryContextSwitchTo(gfbb->context);
+					gfbb->bufferEmptyingQueue = lcons(nodeBuffer, gfbb->bufferEmptyingQueue);
+					MemoryContextSwitchTo(buildstate.tmpCtx);
+					gistProcessEmptyingStack(&buildstate.giststate, &insertstate);
+				}
+				else
+					gfbb->buffersOnLevels[i] = list_delete_first(gfbb->buffersOnLevels[i]);
+			}
+		}
+		MemoryContextSwitchTo(oldCtx);
+	}
+
+	/* okay, all heap tuples are indexed */
+	MemoryContextSwitchTo(oldcxt);
+	MemoryContextDelete(buildstate.tmpCtx);
+
+	freeGISTstate(&buildstate.giststate);
+
+	/*
+	 * Return statistics
+	 */
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+	result->heap_tuples = reltuples;
+	result->index_tuples = (double) buildstate.indtuples;
+
+	PG_RETURN_POINTER(result);
+}
+
+
+/*
+ * Validator for "buffering" reloption on GiST indexes. Allows "on", "off"
+ * and "auto" values.
+ */
+void
+gistValidateBufferingOption(char *value)
+{
+	if (value == NULL ||
+		(strcmp(value, "on") != 0 &&
+		 strcmp(value, "off") != 0 &&
+		 strcmp(value, "auto") != 0))
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid value for \"buffering\" option"),
+				 errdetail("Valid values are \"on\", \"off\" and \"auto\".")));
+	}
+}
+
+/*
+ * Free unreferenced parts of a path stack.
+ */
+static void
+gistFreeUnreferencedPath(GISTBufferingInsertStack *path)
+{
+	while (path->refCount == 0)
+	{
+		/*
+		 * Path part is unreferenced. We can free it and decrease reference
+		 * count of parent. If parent becomes unreferenced too procedure
+		 * should be repeated for it.
+		 */
+		GISTBufferingInsertStack *tmp = path->parent;
+
+		pfree(path);
+		path = tmp;
+		if (path)
+			path->refCount--;
+		else
+			break;
+	}
+}
+
+/*
+ * Decrease reference count of path part, and free any unreferenced parts of
+ * the path stack.
+ */
+void
+gistDecreasePathRefcount(GISTBufferingInsertStack *path)
+{
+	path->refCount--;
+	gistFreeUnreferencedPath(path);
+}
+
+/*
+ * Process an index tuple. Runs the tuple down the tree until we reach a leaf
+ * page or node buffer, and inserts the tuple there. Returns true if we have
+ * to stop buffer emptying process (because one of child buffers can't take
+ * index tuples anymore).
+ */
+static bool
+gistProcessItup(GISTSTATE *giststate, GISTInsertState *state,
+				GISTBuildBuffers *gfbb, IndexTuple itup,
+				GISTBufferingInsertStack *startparent)
+{
+	GISTBufferingInsertStack *path;
+	BlockNumber childblkno;
+	Buffer		buffer;
+	bool		result = false;
+
+	/*
+	 * NULL passed in startparent means that we start index tuple processing
+	 * from the root.
+	 */
+	if (!startparent)
+		path = gfbb->rootitem;
+	else
+		path = startparent;
+
+	/*
+	 * Loop until we reach a leaf page (level == 0) or a level with buffers
+	 * (not including the level we start at, because we would otherwise make
+	 * no progress).
+	 */
+	for (;;)
+	{
+		ItemId		iid;
+		IndexTuple	idxtuple,
+					newtup;
+		Page		page;
+		OffsetNumber childoffnum;
+		GISTBufferingInsertStack *parent;
+
+		/* Have we reached a level with buffers? */
+		if (LEVEL_HAS_BUFFERS(path->level, gfbb) && path != startparent)
+			break;
+
+		/* Have we reached a leaf page? */
+		if (path->level == 0)
+			break;
+
+		/*
+		 * Nope. Descend down to the next level then. Choose a child to descend
+		 * down to.
+		 */
+		buffer = ReadBuffer(state->r, path->blkno);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+
+		page = (Page) BufferGetPage(buffer);
+		childoffnum = gistchoose(state->r, page, itup, giststate);
+		iid = PageGetItemId(page, childoffnum);
+		idxtuple = (IndexTuple) PageGetItem(page, iid);
+		childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+
+		/*
+		 * Check that the key representing the target child node is
+		 * consistent with the key we're inserting. Update it if it's not.
+		 */
+		newtup = gistgetadjusted(state->r, idxtuple, itup, giststate);
+		if (newtup)
+			gistbufferinginserttuples(state, giststate, buffer, &newtup, 1,
+									  childoffnum, path);
+		UnlockReleaseBuffer(buffer);
+
+		/* Create new path item representing current page */
+		parent = path;
+		path = (GISTBufferingInsertStack *) MemoryContextAlloc(gfbb->context,
+										   sizeof(GISTBufferingInsertStack));
+		path->parent = parent;
+		path->level = parent->level - 1;
+		path->blkno = childblkno;
+		path->downlinkoffnum = childoffnum;
+		path->refCount = 0;		/* it's unreferenced for now */
+
+		/* Adjust reference count of parent */
+		if (parent)
+			parent->refCount++;
+	}
+
+	if (LEVEL_HAS_BUFFERS(path->level, gfbb))
+	{
+		/*
+		 * We've reached level with buffers. Place the index tuple to the
+		 * buffer, and add the buffer to the emptying queue if it overflows.
+		 */
+		GISTNodeBuffer *childNodeBuffer;
+
+		/* Find the buffer or create a new one */
+		childNodeBuffer = gistGetNodeBuffer(gfbb, giststate, path->blkno,
+											path->downlinkoffnum, path->parent);
+
+		/* Add index tuple to it */
+		gistPushItupToNodeBuffer(gfbb, childNodeBuffer, itup);
+
+		if (BUFFER_OVERFLOWED(childNodeBuffer, gfbb))
+			result = true;
+	}
+	else
+	{
+		/*
+		 * We've reached a leaf page. Place the tuple here.
+		 */
+		buffer = ReadBuffer(state->r, path->blkno);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistbufferinginserttuples(state, giststate, buffer, &itup, 1,
+								  InvalidOffsetNumber, path);
+		UnlockReleaseBuffer(buffer);
+	}
+
+	/*
+	 * Free unreferenced path items, if any. Path item may be referenced by
+	 * node buffer.
+	 */
+	gistFreeUnreferencedPath(path);
+
+	return result;
+}
+
+/*
+ * Insert tuples to a given page.
+ *
+ * This is analogous with gistinserttuples() in the regular insertion code.
+ */
+static void
+gistbufferinginserttuples(GISTInsertState *state, GISTSTATE *giststate,
+						  Buffer buffer,
+						  IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
+						  GISTBufferingInsertStack *path)
+{
+	GISTBuildBuffers *gfbb = giststate->gfbb;
+	List	   *splitinfo;
+	bool		is_split;
+
+	is_split = gistplacetopage(state, giststate, buffer,
+							   itup, ntup, oldoffnum,
+							   InvalidBuffer,
+							   &splitinfo);
+	/*
+	 * If this is a root split, update the root path item kept in memory.
+	 * This ensures that all path stacks are always complete, including all
+	 * parent nodes up to the root. That simplifies the algorithm to re-find
+	 * correct parent.
+	 */
+	if (is_split && BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO)
+	{
+		GISTBufferingInsertStack *oldroot = gfbb->rootitem;
+		Page		page = BufferGetPage(buffer);
+		ItemId		iid;
+		IndexTuple	idxtuple;
+		BlockNumber leftmostchild;
+
+		gfbb->rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc(
+			gfbb->context, sizeof(GISTBufferingInsertStack));
+		gfbb->rootitem->parent = NULL;
+		gfbb->rootitem->blkno = GIST_ROOT_BLKNO;
+		gfbb->rootitem->downlinkoffnum = InvalidOffsetNumber;
+		gfbb->rootitem->level = oldroot->level + 1;
+		gfbb->rootitem->refCount = 1;
+
+		/*
+		 * All the downlinks on the old root page are now on one of the child
+		 * pages. Change the block number of the old root entry in the stack
+		 * to point to the leftmost child. The other child pages will be
+		 * accessible from there by walking right.
+		 */
+		iid = PageGetItemId(page, FirstOffsetNumber);
+		idxtuple = (IndexTuple) PageGetItem(page, iid);
+		leftmostchild = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+
+		oldroot->parent = gfbb->rootitem;
+		oldroot->blkno = leftmostchild;
+		oldroot->downlinkoffnum = InvalidOffsetNumber;
+	}
+
+	if (splitinfo)
+	{
+		/*
+		 * Insert the downlinks to the parent. This is analogous with
+		 * gistfinishsplit() in the regular insertion code, but the locking
+		 * is simpler, and we have to maintain the buffers.
+		 */
+		IndexTuple *downlinks;
+		int			ndownlinks,
+					i;
+		Buffer		parentBuffer;
+		ListCell   *lc;
+
+		/* Parent may have changed since we memorized this path. */
+		gistBufferingFindCorrectParent(giststate, state->r, path);
+
+		/*
+		 * If there's a buffer associated with this page, that needs to
+		 * be split too. gistRelocateBuildBuffersOnSplit() will also adjust
+		 * the downlinks in 'splitinfo', to make sure they're consistent not
+		 * only with the tuples already on the pages, but also the tuples in
+		 * the buffers that will eventually be inserted to them.
+		 */
+		gistRelocateBuildBuffersOnSplit(gfbb, giststate, state->r,
+										path, buffer, splitinfo);
+
+		/* Create an array of all the downlink tuples */
+		ndownlinks = list_length(splitinfo);
+		downlinks = (IndexTuple *) palloc(sizeof(IndexTuple) * ndownlinks);
+		i = 0;
+		foreach(lc, splitinfo)
+		{
+			GISTPageSplitInfo *splitinfo = lfirst(lc);
+
+			/*
+			 * Since there's no concurrent access, we can release the lower
+			 * level buffers immediately. Don't release the buffer for the
+			 * original page, though, because the caller will release that.
+			 */
+			if (splitinfo->buf != buffer)
+				UnlockReleaseBuffer(splitinfo->buf);
+			downlinks[i++] = splitinfo->downlink;
+		}
+
+		/* Insert them into parent. */
+		parentBuffer = ReadBuffer(state->r, path->parent->blkno);
+		LockBuffer(parentBuffer, GIST_EXCLUSIVE);
+		gistbufferinginserttuples(state, giststate, parentBuffer,
+								  downlinks, ndownlinks,
+								  path->downlinkoffnum, path->parent);
+		UnlockReleaseBuffer(parentBuffer);
+
+		list_free_deep(splitinfo);		/* we don't need this anymore */
+	}
+}
+
+/*
+ * Find correct parent by following rightlinks in buffering index build. This
+ * method of parent searching is possible because no concurrent activity is
+ * possible while index builds.
+ */
+static void
+gistBufferingFindCorrectParent(GISTSTATE *giststate, Relation r,
+							   GISTBufferingInsertStack *child)
+{
+	GISTBuildBuffers *gfbb = giststate->gfbb;
+	GISTBufferingInsertStack *parent = child->parent;
+	OffsetNumber i,
+				maxoff;
+	ItemId		iid;
+	IndexTuple	idxtuple;
+	Buffer		buffer;
+	Page		page;
+	bool		copied = false;
+
+	buffer = ReadBuffer(r, parent->blkno);
+	page = BufferGetPage(buffer);
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	gistcheckpage(r, buffer);
+
+	/* Check if it was not moved */
+	if (child->downlinkoffnum != InvalidOffsetNumber)
+	{
+		iid = PageGetItemId(page, child->downlinkoffnum);
+		idxtuple = (IndexTuple) PageGetItem(page, iid);
+		if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == child->blkno)
+		{
+			/* Still there */
+			UnlockReleaseBuffer(buffer);
+			return;
+		}
+	}
+
+	/* parent has changed, look child in right links until found */
+	while (true)
+	{
+		/* Search for relevant downlink in the current page */
+		maxoff = PageGetMaxOffsetNumber(page);
+		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+		{
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+			if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == child->blkno)
+			{
+				/* yes!!, found */
+				child->downlinkoffnum = i;
+				UnlockReleaseBuffer(buffer);
+				return;
+			}
+		}
+
+		/*
+		 * We should copy parent path item because some other path items can
+		 * refer to it.
+		 */
+		if (!copied)
+		{
+			parent = (GISTBufferingInsertStack *) MemoryContextAlloc(gfbb->context,
+										   sizeof(GISTBufferingInsertStack));
+			memcpy(parent, child->parent, sizeof(GISTBufferingInsertStack));
+			if (parent->parent)
+				parent->parent->refCount++;
+			gistDecreasePathRefcount(child->parent);
+			child->parent = parent;
+			parent->refCount = 1;
+			copied = true;
+		}
+
+		/*
+		 * Not found in current page. Move towards rightlink.
+		 */
+		parent->blkno = GistPageGetOpaque(page)->rightlink;
+		UnlockReleaseBuffer(buffer);
+
+		if (parent->blkno == InvalidBlockNumber)
+		{
+			/*
+			 * End of chain and still didn't find parent. Should not happen
+			 * during index build.
+			 */
+			break;
+		}
+
+		/* Get the next page */
+		buffer = ReadBuffer(r, parent->blkno);
+		page = BufferGetPage(buffer);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistcheckpage(r, buffer);
+	}
+
+	elog(ERROR, "failed to re-find parent for block %u", child->blkno);
+}
+
+/*
+ * Process buffers emptying stack. Emptying of one buffer can cause emptying
+ * of other buffers. This function iterates until this cascading emptying
+ * process finished, e.g. until buffers emptying stack is empty.
+ */
+static void
+gistProcessEmptyingStack(GISTSTATE *giststate, GISTInsertState *state)
+{
+	GISTBuildBuffers *gfbb = giststate->gfbb;
+
+	/* Iterate while we have elements in buffers emptying stack. */
+	while (gfbb->bufferEmptyingQueue != NIL)
+	{
+		GISTNodeBuffer *emptyingNodeBuffer;
+
+		/* Get node buffer from emptying stack. */
+		emptyingNodeBuffer = (GISTNodeBuffer *) linitial(gfbb->bufferEmptyingQueue);
+		gfbb->bufferEmptyingQueue = list_delete_first(gfbb->bufferEmptyingQueue);
+		emptyingNodeBuffer->queuedForEmptying = false;
+
+		/*
+		 * We are going to load last pages of buffers where emptying will be
+		 * to. So let's unload any previously loaded buffers.
+		 */
+		gistUnloadNodeBuffers(gfbb);
+
+		/* Variables for split of current emptying buffer detection. */
+		gfbb->currentEmptyingBufferSplit = false;
+		gfbb->currentEmptyingBufferBlockNumber = emptyingNodeBuffer->nodeBlocknum;
+
+		while (true)
+		{
+			IndexTuple	itup;
+
+			/* Get next index tuple from the buffer */
+			if (!gistPopItupFromNodeBuffer(gfbb, emptyingNodeBuffer, &itup))
+				break;
+
+			/* Run it down to the underlying node buffer or leaf page */
+			if (gistProcessItup(giststate, state, gfbb, itup, emptyingNodeBuffer->path))
+				break;
+
+			/* Free all the memory allocated during index tuple processing */
+			MemoryContextReset(CurrentMemoryContext);
+
+			/*
+			 * If current emptying node buffer split, we have to stop emptying
+			 * it, because the buffer might not exist anymore.
+			 */
+			if (gfbb->currentEmptyingBufferSplit)
+				break;
+		}
+	}
+}
+
+/*
+ * Insert function for buffering index build.
+ */
+static void
+gistBufferingBuildInsert(Relation index, IndexTuple itup,
+						 GISTBuildState *buildstate)
+{
+	GISTBuildBuffers *gfbb = buildstate->giststate.gfbb;
+	GISTInsertState insertstate;
+
+	memset(&insertstate, 0, sizeof(GISTInsertState));
+	insertstate.freespace = buildstate->freespace;
+	insertstate.r = index;
+
+	/* We are ready for index tuple processing */
+	gistProcessItup(&buildstate->giststate, &insertstate, gfbb, itup, NULL);
+
+	/* Process buffer emptying stack if any */
+	gistProcessEmptyingStack(&buildstate->giststate, &insertstate);
+}
+
+/*
+ * Per-tuple callback from IndexBuildHeapScan.
+ */
+static void
+gistBuildCallback(Relation index,
+				  HeapTuple htup,
+				  Datum *values,
+				  bool *isnull,
+				  bool tupleIsAlive,
+				  void *state)
+{
+	GISTBuildState *buildstate = (GISTBuildState *) state;
+	IndexTuple	itup;
+	MemoryContext oldCtx;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	/* form an index tuple and point it at the heap tuple */
+	itup = gistFormTuple(&buildstate->giststate, index, values, isnull, true);
+	itup->t_tid = htup->t_self;
+
+	if (buildstate->bufferingMode == GIST_BUFFERING_ACTIVE)
+	{
+		/* We have buffers, so use them. */
+		gistBufferingBuildInsert(index, itup, buildstate);
+	}
+	else
+	{
+		/*
+		 * There's no buffers (yet). Since we already have the index relation
+		 * locked, we call gistdoinsert directly.
+		 *
+		 * In this path we respect the fillfactor setting, whereas insertions
+		 * after initial build do not.
+		 */
+		gistdoinsert(index, itup, buildstate->freespace,
+					 &buildstate->giststate);
+	}
+
+	/* Increase statistics of index tuples count and their total size. */
+	buildstate->indtuples += 1;
+	buildstate->indtuplesSize += IndexTupleSize(itup);
+
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(buildstate->tmpCtx);
+
+	if (buildstate->bufferingMode == GIST_BUFFERING_ACTIVE &&
+		buildstate->indtuples % BUFFERING_MODE_TUPLE_SIZE_STATS_TARGET == 0)
+	{
+		/* Adjust the target buffer size now */
+		buildstate->giststate.gfbb->pagesPerBuffer =
+			calculatePagesPerBuffer(buildstate, index,
+									buildstate->giststate.gfbb->levelStep);
+	}
+
+	/*
+	 * In 'auto' mode, check if the index has grown too large to fit in
+	 * cache, and switch to buffering mode if it has.
+	 *
+	 * To avoid excessive calls to smgrnblocks(), only check this every
+	 * BUFFERING_MODE_SWITCH_CHECK_STEP index tuples
+	 */
+	if ((buildstate->bufferingMode == GIST_BUFFERING_AUTO &&
+		 buildstate->indtuples % BUFFERING_MODE_SWITCH_CHECK_STEP == 0 &&
+		 effective_cache_size < smgrnblocks(index->rd_smgr, MAIN_FORKNUM)) ||
+		(buildstate->bufferingMode == GIST_BUFFERING_STATS &&
+		 buildstate->indtuples >= BUFFERING_MODE_TUPLE_SIZE_STATS_TARGET))
+	{
+		/*
+		 * Index doesn't fit in effective cache anymore. Try to switch to
+		 * buffering build mode.
+		 */
+		if (gistInitBuffering(buildstate, index))
+		{
+			/*
+			 * Buffering build is successfully initialized. Now we can set
+			 * appropriate flag.
+			 */
+			buildstate->bufferingMode = GIST_BUFFERING_ACTIVE;
+		}
+		else
+		{
+			/*
+			 * Failed to switch to buffering build due to not enough memory
+			 * settings. Mark that we aren't going to switch anymore.
+			 */
+			buildstate->bufferingMode = GIST_BUFFERING_DISABLED;
+		}
+	}
+}
+
+/*
+ * Calculate pagesPerBuffer parameter for the buffering algorithm.
+ *
+ * Buffer size is chosen so that assuming that tuples are distributed
+ * randomly, emptying half a buffer fills on average one page in every buffer
+ * at the next lower level.
+ */
+static int
+calculatePagesPerBuffer(GISTBuildState *buildstate, Relation index,
+						int levelStep)
+{
+	double		pagesPerBuffer;
+	double		avgIndexTuplesPerPage;
+	double		itupAvgSize;
+	Size		pageFreeSpace;
+
+	/* Calc space of index page which is available for index tuples */
+	pageFreeSpace = BLCKSZ - SizeOfPageHeaderData - sizeof(GISTPageOpaqueData)
+		- sizeof(ItemIdData)
+		- buildstate->freespace;
+
+	/*
+	 * Calculate average size of already inserted index tuples using
+	 * gathered statistics.
+	 */
+	itupAvgSize = (double) buildstate->indtuplesSize /
+				  (double) buildstate->indtuples;
+
+	avgIndexTuplesPerPage = pageFreeSpace / itupAvgSize;
+
+	/*
+	 * Recalculate required size of buffers.
+	 */
+	pagesPerBuffer = 2 * pow(avgIndexTuplesPerPage, levelStep);
+
+	return round(pagesPerBuffer);
+}
+
+
+/*
+ * Get the depth of the GiST index.
+ */
+static int
+gistGetMaxLevel(Relation index)
+{
+	int			maxLevel;
+	BlockNumber blkno;
+
+	/*
+	 * Traverse down the tree, starting from the root, until we hit the
+	 * leaf level.
+	 */
+	maxLevel = 0;
+	blkno = GIST_ROOT_BLKNO;
+	while (true)
+	{
+		Buffer		buffer;
+		Page		page;
+		IndexTuple	itup;
+
+		buffer = ReadBuffer(index, blkno);
+		page = (Page) BufferGetPage(buffer);
+
+		if (GistPageIsLeaf(page))
+		{
+			/* We hit the bottom, so we're done. */
+			ReleaseBuffer(buffer);
+			break;
+		}
+
+		/*
+		 * Pick the first downlink on the page, and follow it. It doesn't
+		 * matter which downlink we choose, the tree has the same depth
+		 * everywhere, so we just pick the first one.
+		 */
+		itup = (IndexTuple) PageGetItem(page,
+									 PageGetItemId(page, FirstOffsetNumber));
+		blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+		ReleaseBuffer(buffer);
+
+		/*
+		 * We're going down on the tree. It means that there is yet one more
+		 * level is the tree.
+		 */
+		maxLevel++;
+	}
+	return maxLevel;
+}
+
+/*
+ * Initial calculations for GiST buffering build.
+ */
+static bool
+gistInitBuffering(GISTBuildState *buildstate, Relation index)
+{
+	int			pagesPerBuffer;
+	Size		pageFreeSpace;
+	Size		itupAvgSize,
+				itupMinSize;
+	double		avgIndexTuplesPerPage,
+				maxIndexTuplesPerPage;
+	int			i;
+	int			levelStep;
+	GISTBuildBuffers *gfbb;
+
+	/* Calc space of index page which is available for index tuples */
+	pageFreeSpace = BLCKSZ - SizeOfPageHeaderData - sizeof(GISTPageOpaqueData)
+		- sizeof(ItemIdData)
+		- buildstate->freespace;
+
+	/*
+	 * Calculate average size of already inserted index tuples using gathered
+	 * statistics.
+	 */
+	itupAvgSize = (double) buildstate->indtuplesSize /
+				  (double) buildstate->indtuples;
+
+	/*
+	 * Calculate minimal possible size of index tuple by index metadata.
+	 * Minimal possible size of varlena is VARHDRSZ.
+	 *
+	 * XXX: that's not actually true, as a short varlen can be just 2 bytes.
+	 * And we should take padding into account here.
+	 */
+	itupMinSize = (Size) MAXALIGN(sizeof(IndexTupleData));
+	for (i = 0; i < index->rd_att->natts; i++)
+	{
+		if (index->rd_att->attrs[i]->attlen < 0)
+			itupMinSize += VARHDRSZ;
+		else
+			itupMinSize += index->rd_att->attrs[i]->attlen;
+	}
+
+	/* Calculate average and maximal number of index tuples which fit to page */
+	avgIndexTuplesPerPage = pageFreeSpace / itupAvgSize;
+	maxIndexTuplesPerPage = pageFreeSpace / itupMinSize;
+
+	/*
+	 * We need to calculate two parameters for the buffering algorithm:
+	 * levelStep and pagesPerBuffer.
+	 *
+	 * levelStep determines the size of subtree that we operate on, while
+	 * emptying a buffer. A higher value is better, as you need fewer buffer
+	 * emptying steps to perform the index build. However, if you set it too
+	 * high, the subtree doesn't fit in cache anymore, and you quickly lose
+	 * the benefit of the buffers.
+	 *
+	 * In Arge et al's paper, levelStep is chosen as logB(M/4B), where B is
+	 * the number of tuples on page (ie. fanout), and M is the amount of
+	 * internal memory available. Curiously, they doesn't explain *why* that
+	 * setting is optimal. We calculate it by taking the highest levelStep
+	 * so that a subtree still fits in cache. For a small B, our way of
+	 * calculating levelStep is very close to Arge et al's formula. For a
+	 * large B, our formula gives a value that is 2x higher.
+	 *
+	 * The average size of a subtree of depth n can be calculated as a
+	 * geometric series:
+	 *
+	 *		B^0 + B^1 + B^2 + ... + B^n = (1 - B^(n + 1)) / (1 - B)
+	 *
+	 * where B is the average number of index tuples on page. The subtree is
+	 * cached in the shared buffer cache and the OS cache, so we choose
+	 * levelStep so that the subtree size is comfortably smaller than
+	 * effective_cache_size, with a safety factor of 4.
+	 *
+	 * The estimate on the average number of index tuples on page is based on
+	 * average tuple sizes observed before switching to buffered build, so the
+	 * real subtree size can be somewhat larger. Also, it would selfish to
+	 * gobble the whole cache for our index build. The safety factor of 4
+	 * should account for those effects.
+	 *
+	 * The other limiting factor for setting levelStep is that while
+	 * processing a subtree, we need to hold one page for each buffer at the
+	 * next lower buffered level. The max. number of buffers needed for that
+	 * is maxIndexTuplesPerPage^levelStep. This is very conservative, but
+	 * hopefully maintenance_work_mem is set high enough that you're
+	 * constrained by effective_cache_size rather than maintenance_work_mem.
+	 *
+	 * XXX: the buffer hash table consumes a fair amount of memory too per
+	 * buffer, but that is not currently taken into account. That scales on
+	 * the total number of buffers used, ie. the index size and on levelStep.
+	 * Note that a higher levelStep *reduces* the amount of memory needed for
+	 * the hash table.
+	 */
+	levelStep = 1;
+	while (
+		/* subtree must fit in cache (with safety factor of 4) */
+		(1 - pow(avgIndexTuplesPerPage, (double) (levelStep + 1))) / (1 - avgIndexTuplesPerPage) < effective_cache_size / 4
+		&&
+		/* each node in the lowest level of a subtree has one page in memory */
+		(pow(maxIndexTuplesPerPage, (double) levelStep) < (maintenance_work_mem * 1024) / BLCKSZ)
+		)
+	{
+		levelStep++;
+	}
+
+	/*
+	 * We've just reached unacceptable value of levelStep in previous loop.
+	 * So, decrease levelStep to get last acceptable value.
+	 */
+	levelStep--;
+
+	/*
+	 * If there's not enough cache or maintenance_work_mem, fall back to plain
+	 * inserts.
+	 */
+	if (levelStep <= 0)
+	{
+		elog(DEBUG1, "failed to switch to buffered GiST build");
+		return false;
+	}
+
+	/*
+	 * The second parameter to set is pagesPerBuffer, which determines the
+	 * size of each buffer. We adjust pagesPerBuffer also during the build,
+	 * which is why this calculation is in a separate function.
+	 */
+	pagesPerBuffer = calculatePagesPerBuffer(buildstate, index, levelStep);
+
+	elog(DEBUG1, "switching to buffered GiST build; level step = %d, pagesPerBuffer = %d",
+		 levelStep, pagesPerBuffer);
+
+	/* Initialize GISTBuildBuffers with these parameters */
+	gfbb = palloc(sizeof(GISTBuildBuffers));
+	gfbb->pagesPerBuffer = pagesPerBuffer;
+	gfbb->levelStep = levelStep;
+	gistInitBuildBuffers(gfbb, gistGetMaxLevel(index));
+
+	buildstate->giststate.gfbb = gfbb;
+
+	return true;
+}
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
new file mode 100644
index 0000000..9580c05
--- /dev/null
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -0,0 +1,764 @@
+/*-------------------------------------------------------------------------
+ *
+ * gistbuildbuffers.c
+ *	  node buffer management functions for GiST buffering build algorithm.
+ *
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/gist/gistbuildbuffers.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/gist_private.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "miscadmin.h"
+#include "storage/buffile.h"
+#include "storage/bufmgr.h"
+#include "storage/indexfsm.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+static GISTNodeBufferPage *gistAllocateNewPageBuffer(GISTBuildBuffers *gfbb);
+static void gistAddLoadedBuffer(GISTBuildBuffers *gfbb, BlockNumber blocknum);
+static void gistLoadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer);
+static void gistUnloadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer);
+static void gistPlaceItupToPage(GISTNodeBufferPage *pageBuffer, IndexTuple item);
+static void gistGetItupFromPage(GISTNodeBufferPage *pageBuffer, IndexTuple *item);
+static long gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb);
+static void gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, long blocknum);
+
+/*
+ * Initialize GiST buffering build data structure.
+ */
+void
+gistInitBuildBuffers(GISTBuildBuffers *gfbb, int maxLevel)
+{
+	HASHCTL		hashCtl;
+
+	/*
+	 * Create a temporary file to hold buffer pages that are swapped out
+	 * of memory. Initialize data structures for free pages management.
+	 */
+	gfbb->pfile = BufFileCreateTemp(true);
+	gfbb->nFileBlocks = 0;
+	gfbb->nFreeBlocks = 0;
+	gfbb->freeBlocksLen = 32;
+	gfbb->freeBlocks = (long *) palloc(gfbb->freeBlocksLen * sizeof(long));
+
+	/*
+	 * Current memory context will be used for all in-memory data structures
+	 * of buffers which are persistent during buffering build.
+	 */
+	gfbb->context = CurrentMemoryContext;
+
+	/*
+	 * nodeBuffersTab hash is association between index blocks and it's
+	 * buffers.
+	 */
+	hashCtl.keysize = sizeof(BlockNumber);
+	hashCtl.entrysize = sizeof(GISTNodeBuffer);
+	hashCtl.hcxt = CurrentMemoryContext;
+	hashCtl.hash = tag_hash;
+	hashCtl.match = memcmp;
+	gfbb->nodeBuffersTab = hash_create("gistbuildbuffers",
+									   1024,
+									   &hashCtl,
+									   HASH_ELEM | HASH_CONTEXT
+									   | HASH_FUNCTION | HASH_COMPARE);
+
+	gfbb->bufferEmptyingQueue = NIL;
+
+	gfbb->currentEmptyingBufferBlockNumber = InvalidBlockNumber;
+	gfbb->currentEmptyingBufferSplit = false;
+
+	/*
+	 * Per-level node buffers lists for final buffers emptying process. Node
+	 * buffers are inserted here when they are created.
+	 */
+	gfbb->buffersOnLevelsLen = 1;
+	gfbb->buffersOnLevels = (List **) palloc(sizeof(List *) *
+											 gfbb->buffersOnLevelsLen);
+	gfbb->buffersOnLevels[0] = NIL;
+
+	/*
+	 * Block numbers of node buffers which last pages are currently loaded
+	 * into main memory.
+	 */
+	gfbb->loadedBuffersLen = 32;
+	gfbb->loadedBuffers = (BlockNumber *) palloc(gfbb->loadedBuffersLen *
+												 sizeof(BlockNumber));
+	gfbb->loadedBuffersCount = 0;
+
+	/*
+	 * Root path item of the tree. Updated on each root node split.
+	 */
+	gfbb->rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc(
+							gfbb->context, sizeof(GISTBufferingInsertStack));
+	gfbb->rootitem->parent = NULL;
+	gfbb->rootitem->blkno = GIST_ROOT_BLKNO;
+	gfbb->rootitem->downlinkoffnum = InvalidOffsetNumber;
+	gfbb->rootitem->level = maxLevel;
+	gfbb->rootitem->refCount = 1;
+}
+
+/*
+ * Returns a node buffer for given block. The buffer is created if it
+ * doesn't exist yet.
+ */
+GISTNodeBuffer *
+gistGetNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
+				  BlockNumber nodeBlocknum,
+				  OffsetNumber downlinkoffnum,
+				  GISTBufferingInsertStack *parent)
+{
+	GISTNodeBuffer *nodeBuffer;
+	bool		found;
+
+	/* Find node buffer in hash table */
+	nodeBuffer = (GISTNodeBuffer *) hash_search(gfbb->nodeBuffersTab,
+												(const void *) &nodeBlocknum,
+												HASH_ENTER,
+												&found);
+	if (!found)
+	{
+		/*
+		 * Node buffer wasn't found. Initialize the new buffer as empty.
+		 */
+		GISTBufferingInsertStack *path;
+		int			level;
+		MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
+
+		nodeBuffer->pageBuffer = NULL;
+		nodeBuffer->blocksCount = 0;
+		nodeBuffer->queuedForEmptying = false;
+
+		/*
+		 * Create a path stack for the page.
+		 */
+		if (nodeBlocknum != GIST_ROOT_BLKNO)
+		{
+			path = (GISTBufferingInsertStack *) palloc(
+										   sizeof(GISTBufferingInsertStack));
+			path->parent = parent;
+			path->blkno = nodeBlocknum;
+			path->downlinkoffnum = downlinkoffnum;
+			path->level = parent->level - 1;
+			path->refCount = 0;		/* initially unreferenced */
+			parent->refCount++;		/* this path references its parent */
+			Assert(path->level > 0);
+		}
+		else
+			path = gfbb->rootitem;
+
+		nodeBuffer->path = path;
+		path->refCount++;
+
+		/*
+		 * Add this buffer to the list of buffers on this level. Enlarge
+		 * buffersOnLevels array if needed.
+		 */
+		level = path->level;
+		if (level >= gfbb->buffersOnLevelsLen)
+		{
+			int			i;
+
+			gfbb->buffersOnLevels =
+				(List **) repalloc(gfbb->buffersOnLevels,
+								   (level + 1) * sizeof(List *));
+
+			/* initialize the enlarged portion */
+			for (i = gfbb->buffersOnLevelsLen; i <= level; i++)
+				gfbb->buffersOnLevels[i] = NIL;
+			gfbb->buffersOnLevelsLen = level + 1;
+		}
+
+		gfbb->buffersOnLevels[level] = lcons(nodeBuffer,
+											 gfbb->buffersOnLevels[level]);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		if (parent != nodeBuffer->path->parent)
+		{
+			/*
+			 * Other parent path item was provided than we've remembered. We
+			 * trust caller to provide more correct parent than we have.
+			 * Previous parent may be outdated by page split.
+			 */
+			gistDecreasePathRefcount(nodeBuffer->path->parent);
+			nodeBuffer->path->parent = parent;
+			parent->refCount++;
+		}
+	}
+
+	return nodeBuffer;
+}
+
+/*
+ * Allocate memory for a buffer page.
+ */
+static GISTNodeBufferPage *
+gistAllocateNewPageBuffer(GISTBuildBuffers *gfbb)
+{
+	GISTNodeBufferPage *pageBuffer;
+
+	pageBuffer = (GISTNodeBufferPage *) MemoryContextAlloc(gfbb->context,
+														   BLCKSZ);
+	pageBuffer->prev = InvalidBlockNumber;
+
+	/* Set page free space */
+	PAGE_FREE_SPACE(pageBuffer) = BLCKSZ - BUFFER_PAGE_DATA_OFFSET;
+	return pageBuffer;
+}
+
+/*
+ * Add specified block number into loadedBuffers array.
+ */
+static void
+gistAddLoadedBuffer(GISTBuildBuffers *gfbb, BlockNumber blocknum)
+{
+	/* Enlarge the array if needed */
+	if (gfbb->loadedBuffersCount >= gfbb->loadedBuffersLen)
+	{
+		gfbb->loadedBuffersLen *= 2;
+		gfbb->loadedBuffers = (BlockNumber *) repalloc(gfbb->loadedBuffers,
+													 gfbb->loadedBuffersLen *
+													   sizeof(BlockNumber));
+	}
+
+	gfbb->loadedBuffers[gfbb->loadedBuffersCount] = blocknum;
+	gfbb->loadedBuffersCount++;
+}
+
+
+/*
+ * Load last page of node buffer into main memory.
+ */
+static void
+gistLoadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer)
+{
+	/* Check if we really should load something */
+	if (!nodeBuffer->pageBuffer && nodeBuffer->blocksCount > 0)
+	{
+		/* Allocate memory for page */
+		nodeBuffer->pageBuffer = gistAllocateNewPageBuffer(gfbb);
+
+		/* Read block from temporary file */
+		BufFileSeekBlock(gfbb->pfile, nodeBuffer->pageBlocknum);
+		BufFileRead(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+
+		/* Mark file block as free */
+		gistBuffersReleaseBlock(gfbb, nodeBuffer->pageBlocknum);
+
+		/* Mark node buffer as loaded */
+		gistAddLoadedBuffer(gfbb, nodeBuffer->nodeBlocknum);
+		nodeBuffer->pageBlocknum = InvalidBlockNumber;
+	}
+}
+
+/*
+ * Write last page of node buffer to the disk.
+ */
+static void
+gistUnloadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer)
+{
+	/* Check if we have something to write */
+	if (nodeBuffer->pageBuffer)
+	{
+		BlockNumber blkno;
+
+		/* Get free file block */
+		blkno = gistBuffersGetFreeBlock(gfbb);
+
+		/* Write block to the temporary file */
+		BufFileSeekBlock(gfbb->pfile, blkno);
+		BufFileWrite(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+
+		/* Free memory of that page */
+		pfree(nodeBuffer->pageBuffer);
+		nodeBuffer->pageBuffer = NULL;
+
+		/* Save block number */
+		nodeBuffer->pageBlocknum = blkno;
+	}
+}
+
+/*
+ * Write last pages of all node buffers to the disk.
+ */
+void
+gistUnloadNodeBuffers(GISTBuildBuffers *gfbb)
+{
+	int			i;
+
+	/* Iterate over node buffers which last page is loaded into main memory */
+	for (i = 0; i < gfbb->loadedBuffersCount; i++)
+	{
+		GISTNodeBuffer *nodeBuffer;
+		bool		found;
+
+		/* Find node buffer by its block number */
+		nodeBuffer = hash_search(gfbb->nodeBuffersTab, &gfbb->loadedBuffers[i],
+								 HASH_FIND, &found);
+
+		/*
+		 * Node buffer can be not found. It can disappear during page split.
+		 * So, it's ok, just skip it.
+		 */
+		if (!found)
+			continue;
+
+		/* Unload last page to the disk */
+		gistUnloadNodeBuffer(gfbb, nodeBuffer);
+	}
+	/* Now there are no node buffers with loaded last page */
+	gfbb->loadedBuffersCount = 0;
+}
+
+/*
+ * Add index tuple to buffer page.
+ */
+static void
+gistPlaceItupToPage(GISTNodeBufferPage *pageBuffer, IndexTuple itup)
+{
+	/*
+	 * Get pointer to the start of free space on the page
+	 */
+	char	   *ptr = (char *) pageBuffer + BUFFER_PAGE_DATA_OFFSET
+	+ PAGE_FREE_SPACE(pageBuffer) - MAXALIGN(IndexTupleSize(itup));
+
+	/*
+	 * There should be enough of space
+	 */
+	Assert(PAGE_FREE_SPACE(pageBuffer) >= MAXALIGN(IndexTupleSize(itup)));
+
+	/*
+	 * Reduce free space value of page
+	 */
+	PAGE_FREE_SPACE(pageBuffer) -= MAXALIGN(IndexTupleSize(itup));
+
+	/*
+	 * Copy index tuple to free space
+	 */
+	memcpy(ptr, itup, IndexTupleSize(itup));
+}
+
+/*
+ * Get last item from buffer page and remove it from page.
+ */
+static void
+gistGetItupFromPage(GISTNodeBufferPage *pageBuffer, IndexTuple *itup)
+{
+	/*
+	 * Get pointer to last index tuple
+	 */
+	IndexTuple	ptr = (IndexTuple) ((char *) pageBuffer
+									+ BUFFER_PAGE_DATA_OFFSET
+									+ PAGE_FREE_SPACE(pageBuffer));
+
+	/*
+	 * Page shouldn't be empty
+	 */
+	Assert(!PAGE_IS_EMPTY(pageBuffer));
+
+	/*
+	 * Allocate memory for returned index tuple copy
+	 */
+	*itup = (IndexTuple) palloc(IndexTupleSize(ptr));
+
+	/*
+	 * Copy data
+	 */
+	memcpy(*itup, ptr, IndexTupleSize(ptr));
+
+	/*
+	 * Increase free space value of page
+	 */
+	PAGE_FREE_SPACE(pageBuffer) += MAXALIGN(IndexTupleSize(*itup));
+}
+
+/*
+ * Push an index tuple to node buffer.
+ */
+void
+gistPushItupToNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer,
+						 IndexTuple itup)
+{
+	/*
+	 * Most part of memory operations will be in buffering build persistent
+	 * context. So, let's switch to it.
+	 */
+	MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
+
+	/* Is the buffer currently empty? */
+	if (nodeBuffer->blocksCount == 0)
+	{
+		/* It's empty, let's create the first page */
+		nodeBuffer->pageBuffer = gistAllocateNewPageBuffer(gfbb);
+		nodeBuffer->blocksCount = 1;
+		gistAddLoadedBuffer(gfbb, nodeBuffer->nodeBlocknum);
+	}
+
+	/* Load last page of node buffer if it wasn't already */
+	if (!nodeBuffer->pageBuffer)
+		gistLoadNodeBuffer(gfbb, nodeBuffer);
+
+	/*
+	 * Check if there is enough space on the last page for the tuple
+	 */
+	if (PAGE_NO_SPACE(nodeBuffer->pageBuffer, itup))
+	{
+		/*
+		 * Nope. Swap previous block to disk and allocate a new one.
+		 */
+		BlockNumber blkno;
+
+		/* Write filled page to the disk */
+		blkno = gistBuffersGetFreeBlock(gfbb);
+		BufFileSeekBlock(gfbb->pfile, blkno);
+		BufFileWrite(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+
+		/* Mark space of in-memory page as empty */
+		PAGE_FREE_SPACE(nodeBuffer->pageBuffer) =
+			BLCKSZ - MAXALIGN(offsetof(GISTNodeBufferPage, tupledata));
+
+		/* Save block number of the previous page */
+		nodeBuffer->pageBuffer->prev = blkno;
+
+		/* We've just added one more page */
+		nodeBuffer->blocksCount++;
+	}
+
+	gistPlaceItupToPage(nodeBuffer->pageBuffer, itup);
+
+	/*
+	 * If the buffer just overflowed, add it to the emptying queue.
+	 */
+	if (BUFFER_HALF_FILLED(nodeBuffer, gfbb) && !nodeBuffer->queuedForEmptying)
+	{
+		MemoryContextSwitchTo(gfbb->context);
+		gfbb->bufferEmptyingQueue =	lcons(nodeBuffer, gfbb->bufferEmptyingQueue);
+		nodeBuffer->queuedForEmptying = true;
+	}
+
+	/* Restore memory context */
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * Removes one index tuple from node buffer. Returns true if success and false
+ * if node buffer is empty.
+ */
+bool
+gistPopItupFromNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer,
+						  IndexTuple *itup)
+{
+	/*
+	 * If node buffer is empty then return false.
+	 */
+	if (nodeBuffer->blocksCount <= 0)
+		return false;
+
+	/* Load last page of node buffer if needed */
+	if (!nodeBuffer->pageBuffer)
+		gistLoadNodeBuffer(gfbb, nodeBuffer);
+
+	/*
+	 * Get index tuple from last non-empty page.
+	 */
+	gistGetItupFromPage(nodeBuffer->pageBuffer, itup);
+
+	/*
+	 * Check if the page which the index tuple was got from is now empty
+	 */
+	if (PAGE_IS_EMPTY(nodeBuffer->pageBuffer))
+	{
+		BlockNumber prevblkno;
+
+		/*
+		 * If it's empty then we need to release buffer file block and free
+		 * page buffer.
+		 */
+		nodeBuffer->blocksCount--;
+
+		/*
+		 * If there's more pages, fetch previous one
+		 */
+		prevblkno = nodeBuffer->pageBuffer->prev;
+		if (prevblkno != InvalidBlockNumber)
+		{
+			/* There actually is previous page, so read it. */
+			Assert(nodeBuffer->blocksCount > 0);
+			BufFileSeekBlock(gfbb->pfile, prevblkno);
+			BufFileRead(gfbb->pfile, nodeBuffer->pageBuffer, BLCKSZ);
+
+			/* Mark block as free */
+			gistBuffersReleaseBlock(gfbb, prevblkno);
+		}
+		else
+		{
+			/* Actually there are no more pages. Free memory. */
+			Assert(nodeBuffer->blocksCount == 0);
+			pfree(nodeBuffer->pageBuffer);
+			nodeBuffer->pageBuffer = NULL;
+		}
+	}
+	return true;
+}
+
+/*
+ * Select a currently unused block for writing to.
+ *
+ * NB: should only be called when writer is ready to write immediately,
+ * to ensure that first write pass is sequential.
+ */
+static long
+gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb)
+{
+	/*
+	 * If there are multiple free blocks, we select the one appearing last in
+	 * freeBlocks[].  If there are none, assign the next block at the end of
+	 * the file.
+	 */
+	if (gfbb->nFreeBlocks > 0)
+		return gfbb->freeBlocks[--gfbb->nFreeBlocks];
+	else
+		return gfbb->nFileBlocks++;
+}
+
+/*
+ * Return a block# to the freelist.
+ */
+static void
+gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, long blocknum)
+{
+	int			ndx;
+
+	/*
+	 * Enlarge freeBlocks array if full.
+	 */
+	if (gfbb->nFreeBlocks >= gfbb->freeBlocksLen)
+	{
+		gfbb->freeBlocksLen *= 2;
+		gfbb->freeBlocks = (long *) repalloc(gfbb->freeBlocks,
+											 gfbb->freeBlocksLen *
+											 sizeof(long));
+	}
+
+	/*
+	 * Add blocknum to array, and mark the array unsorted if it's no longer in
+	 * decreasing order.
+	 */
+	ndx = gfbb->nFreeBlocks++;
+	gfbb->freeBlocks[ndx] = blocknum;
+}
+
+/*
+ * Free buffering build data structure.
+ */
+void
+gistFreeBuildBuffers(GISTBuildBuffers *gfbb)
+{
+	/* Close buffers file. */
+	BufFileClose(gfbb->pfile);
+
+	/* All other things will be freed on memory context release */
+}
+
+/*
+ * Data structure representing information about node buffer for index tuples
+ * relocation from splitted node buffer.
+ */
+typedef struct
+{
+	GISTENTRY	entry[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	GISTPageSplitInfo *splitinfo;
+	GISTNodeBuffer *nodeBuffer;
+} RelocationBufferInfo;
+
+/*
+ * Maintain data structures on page split.
+ */
+void
+gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
+								Relation r, GISTBufferingInsertStack *path,
+								Buffer buffer, List *splitinfo)
+{
+	RelocationBufferInfo *relocationBuffersInfos;
+	bool		found;
+	GISTNodeBuffer *nodeBuffer;
+	BlockNumber blocknum;
+	IndexTuple	itup;
+	int			splitPagesCount = 0,
+				i;
+	GISTENTRY	entry[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	GISTNodeBuffer nodebuf;
+	ListCell   *lc;
+
+	/*
+	 * If the splitted page level doesn't have buffers, we have nothing to do.
+	 */
+	if (!LEVEL_HAS_BUFFERS(path->level, gfbb))
+		return;
+
+	/*
+	 * Get pointer to node buffer of splitted page.
+	 */
+	blocknum = BufferGetBlockNumber(buffer);
+	nodeBuffer = hash_search(gfbb->nodeBuffersTab, &blocknum,
+							 HASH_FIND, &found);
+	if (!found)
+	{
+		/*
+		 * Node buffer should exist at this point. If it didn't exist before,
+		 * the insertion that caused the page to split should've created it.
+		 */
+		elog(ERROR, "node buffer of page being split (%u) does not exist",
+			 blocknum);
+	}
+
+	/*
+	 * Make a copy of the old buffer, as we're going reuse the old one as
+	 * the buffer for the new left page, which is on the same block as the
+	 * old page. That's not true for the root page, but that's fine because
+	 * we never have a buffer on the root page anyway. The original algorithm
+	 * as described by Arge et al did, but it's of no use, as you might as
+	 * well read the tuples straight from the heap instead of the root buffer.
+	 */
+	Assert(blocknum != GIST_ROOT_BLKNO);
+	memcpy(&nodebuf, nodeBuffer, sizeof(GISTNodeBuffer));
+
+	/* Reset the old buffer, used for the new left page from now on */
+	nodeBuffer->blocksCount = 0;
+	nodeBuffer->pageBuffer = NULL;
+	nodeBuffer->pageBlocknum = InvalidBlockNumber;
+
+	/* Reassign pointer to the saved copy. */
+	nodeBuffer = &nodebuf;
+
+	/*
+	 * Allocate memory for information about relocation buffers.
+	 */
+	splitPagesCount = list_length(splitinfo);
+	relocationBuffersInfos =
+		(RelocationBufferInfo *) palloc(sizeof(RelocationBufferInfo) *
+										splitPagesCount);
+
+	/*
+	 * Fill relocation buffers information for node buffers of pages produced
+	 * by split.
+	 */
+	i = 0;
+	foreach(lc, splitinfo)
+	{
+		GISTPageSplitInfo *si = (GISTPageSplitInfo *) lfirst(lc);
+		GISTNodeBuffer *newNodeBuffer;
+
+		/* Decompress parent index tuple of node buffer page. */
+		gistDeCompressAtt(giststate, r,
+						  si->downlink, NULL, (OffsetNumber) 0,
+						  relocationBuffersInfos[i].entry,
+						  relocationBuffersInfos[i].isnull);
+
+		newNodeBuffer = gistGetNodeBuffer(gfbb, giststate, BufferGetBlockNumber(si->buf),
+								   path->downlinkoffnum, path->parent);
+
+		relocationBuffersInfos[i].nodeBuffer = newNodeBuffer;
+		relocationBuffersInfos[i].splitinfo = si;
+
+		i++;
+	}
+
+	/*
+	 * Loop through all index tuples on the buffer on the splitted page,
+	 * moving all the tuples to the buffers on the new pages.
+	 */
+	while (gistPopItupFromNodeBuffer(gfbb, nodeBuffer, &itup))
+	{
+		float		sum_grow,
+					which_grow[INDEX_MAX_KEYS];
+		int			i,
+					which;
+		IndexTuple	newtup;
+
+		/*
+		 * Choose which page this tuple should go to.
+		 */
+		gistDeCompressAtt(giststate, r,
+						  itup, NULL, (OffsetNumber) 0, entry, isnull);
+
+		which = -1;
+		*which_grow = -1.0f;
+		sum_grow = 1.0f;
+
+		for (i = 0; i < splitPagesCount && sum_grow; i++)
+		{
+			int			j;
+			RelocationBufferInfo *splitPageInfo = &relocationBuffersInfos[i];
+
+			sum_grow = 0.0f;
+			for (j = 0; j < r->rd_att->natts; j++)
+			{
+				float		usize;
+
+				usize = gistpenalty(giststate, j,
+									&splitPageInfo->entry[j],
+									splitPageInfo->isnull[j],
+									&entry[j], isnull[j]);
+
+				if (which_grow[j] < 0 || usize < which_grow[j])
+				{
+					which = i;
+					which_grow[j] = usize;
+					if (j < r->rd_att->natts - 1 && i == 0)
+						which_grow[j + 1] = -1;
+					sum_grow += which_grow[j];
+				}
+				else if (which_grow[j] == usize)
+					sum_grow += usize;
+				else
+				{
+					sum_grow = 1;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * push item to selected node buffer
+		 */
+		gistPushItupToNodeBuffer(gfbb, relocationBuffersInfos[which].nodeBuffer,
+								 itup);
+
+		/*
+		 * Adjust the downlink for this page, if needed.
+		 */
+		newtup = gistgetadjusted(r, relocationBuffersInfos[which].splitinfo->downlink,
+								 itup, giststate);
+		if (newtup)
+		{
+			gistDeCompressAtt(giststate, r,
+							  newtup, NULL, (OffsetNumber) 0,
+							  relocationBuffersInfos[which].entry,
+							  relocationBuffersInfos[which].isnull);
+
+			relocationBuffersInfos[which].splitinfo->downlink = newtup;
+		}
+	}
+
+	/* Report about splitting for current emptying buffer */
+	if (blocknum == gfbb->currentEmptyingBufferBlockNumber)
+		gfbb->currentEmptyingBufferSplit = true;
+
+	pfree(relocationBuffersInfos);
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 1754a10..bae990b 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -670,13 +670,30 @@ gistoptions(PG_FUNCTION_ARGS)
 {
 	Datum		reloptions = PG_GETARG_DATUM(0);
 	bool		validate = PG_GETARG_BOOL(1);
-	bytea	   *result;
+	relopt_value *options;
+	GiSTOptions *rdopts;
+	int			numoptions;
+	static const relopt_parse_elt tab[] = {
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(GiSTOptions, fillfactor)},
+		{"buffering", RELOPT_TYPE_STRING, offsetof(GiSTOptions, bufferingModeOffset)}
+	};
 
-	result = default_reloptions(reloptions, validate, RELOPT_KIND_GIST);
+	options = parseRelOptions(reloptions, validate, RELOPT_KIND_GIST,
+							  &numoptions);
+
+	/* if none set, we're done */
+	if (numoptions == 0)
+		PG_RETURN_NULL();
+
+	rdopts = allocateReloptStruct(sizeof(GiSTOptions), options, numoptions);
+
+	fillRelOptions((void *) rdopts, sizeof(GiSTOptions), options, numoptions,
+				   validate, tab, lengthof(tab));
+
+	pfree(options);
+
+	PG_RETURN_BYTEA_P(rdopts);
 
-	if (result)
-		PG_RETURN_BYTEA_P(result);
-	PG_RETURN_NULL();
 }
 
 /*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 02c4ec3..9cf4875 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -266,7 +266,8 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 			else
 				GistPageGetOpaque(page)->rightlink = xldata->origrlink;
 			GistPageGetOpaque(page)->nsn = xldata->orignsn;
-			if (i < xlrec.data->npage - 1 && !isrootsplit)
+			if (i < xlrec.data->npage - 1 && !isrootsplit &&
+				!xldata->noFollowRight)
 				GistMarkFollowRight(page);
 			else
 				GistClearFollowRight(page);
@@ -414,7 +415,7 @@ XLogRecPtr
 gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
 			  SplitedPageLayout *dist,
 			  BlockNumber origrlink, GistNSN orignsn,
-			  Buffer leftchildbuf)
+			  Buffer leftchildbuf, bool noFollowFight)
 {
 	XLogRecData *rdata;
 	gistxlogPageSplit xlrec;
@@ -436,6 +437,7 @@ gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
 	xlrec.npage = (uint16) npage;
 	xlrec.leftchild =
 		BufferIsValid(leftchildbuf) ? BufferGetBlockNumber(leftchildbuf) : InvalidBlockNumber;
+	xlrec.noFollowRight = noFollowFight;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = sizeof(gistxlogPageSplit);
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 9fb20a6..3750c2d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -17,13 +17,56 @@
 #include "access/gist.h"
 #include "access/itup.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "utils/rbtree.h"
+#include "utils/hsearch.h"
+
+/* Has specified level buffers? */
+#define LEVEL_HAS_BUFFERS(nlevel, gfbb) ((nlevel) != 0 && (nlevel) % (gfbb)->levelStep == 0 && nlevel != (gfbb)->rootitem->level)
+/* Is specified buffer at least half-filled (should be planned for emptying)?*/
+#define BUFFER_HALF_FILLED(nodeBuffer, gfbb) ((nodeBuffer)->blocksCount > (gfbb)->pagesPerBuffer / 2)
+/* Is specified buffer overflowed (can't take index tuples anymore)?*/
+#define BUFFER_OVERFLOWED(nodeBuffer, gfbb) ((nodeBuffer)->blocksCount > (gfbb)->pagesPerBuffer)
 
 /* Buffer lock modes */
 #define GIST_SHARE	BUFFER_LOCK_SHARE
 #define GIST_EXCLUSIVE	BUFFER_LOCK_EXCLUSIVE
 #define GIST_UNLOCK BUFFER_LOCK_UNLOCK
 
+typedef struct
+{
+	BlockNumber prev;
+	uint32		freespace;
+	char		tupledata[1];
+} GISTNodeBufferPage;
+
+#define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
+/* Returns free space in node buffer page */
+#define PAGE_FREE_SPACE(nbp) (nbp->freespace)
+/* Checks if node buffer page is empty */
+#define PAGE_IS_EMPTY(nbp) (nbp->freespace == BLCKSZ - BUFFER_PAGE_DATA_OFFSET)
+/* Checks if node buffers page don't contain sufficient space for index tuple */
+#define PAGE_NO_SPACE(nbp, itup) (PAGE_FREE_SPACE(nbp) < \
+										MAXALIGN(IndexTupleSize(itup)))
+
+/* Buffer of tree node data structure */
+typedef struct
+{
+	/* number of page containing node */
+	BlockNumber nodeBlocknum;
+
+	/* count of blocks occupied by buffer */
+	int32		blocksCount;
+
+	BlockNumber pageBlocknum;
+	GISTNodeBufferPage *pageBuffer;
+
+	/* is this buffer queued for emptying? */
+	bool		queuedForEmptying;
+
+	struct GISTBufferingInsertStack *path;
+} GISTNodeBuffer;
+
 /*
  * GISTSTATE: information needed for any GiST index operation
  *
@@ -44,6 +87,8 @@ typedef struct GISTSTATE
 	/* Collations to pass to the support functions */
 	Oid			supportCollation[INDEX_MAX_KEYS];
 
+	struct GISTBuildBuffers *gfbb;
+
 	TupleDesc	tupdesc;
 } GISTSTATE;
 
@@ -170,6 +215,7 @@ typedef struct gistxlogPageSplit
 
 	BlockNumber leftchild;		/* like in gistxlogPageUpdate */
 	uint16		npage;			/* # of pages in the split */
+	bool		noFollowRight;	/* skip followRight flag setting */
 
 	/*
 	 * follow: 1. gistxlogPage and array of IndexTupleData per page
@@ -225,6 +271,74 @@ typedef struct GISTInsertStack
 	struct GISTInsertStack *parent;
 } GISTInsertStack;
 
+/*
+ * Extended GISTInsertStack for buffering GiST index build. It additionally hold
+ * level number of page.
+ */
+typedef struct GISTBufferingInsertStack
+{
+	/* current page */
+	BlockNumber blkno;
+
+	/* offset of the downlink in the parent page, that points to this page */
+	OffsetNumber downlinkoffnum;
+
+	/* pointer to parent */
+	struct GISTBufferingInsertStack *parent;
+
+	int			refCount;
+
+	/* level number */
+	int			level;
+}	GISTBufferingInsertStack;
+
+/*
+ * Data structure with general information about build buffers.
+ */
+typedef struct GISTBuildBuffers
+{
+	/* memory context which is persistent during buffering build */
+	MemoryContext context;
+	/* underlying files */
+	BufFile    *pfile;
+	/* # of blocks used in underlying files */
+	long		nFileBlocks;
+	/* resizable array of free blocks */
+	long	   *freeBlocks;
+	/* # of currently free blocks */
+	int			nFreeBlocks;
+	/* current allocated length of freeBlocks[] */
+	int			freeBlocksLen;
+
+	/* hash for buffers by block number */
+	HTAB	   *nodeBuffersTab;
+
+	/* stack of buffers for emptying */
+	List	   *bufferEmptyingQueue;
+	/* number of currently emptying buffer */
+	BlockNumber currentEmptyingBufferBlockNumber;
+	/* whether currently emptying buffer was split - a signal to stop emptying */
+	bool		currentEmptyingBufferSplit;
+
+	/* step of levels for buffers location */
+	int			levelStep;
+	/* maximal number of pages occupied by buffer */
+	int			pagesPerBuffer;
+
+	/* array of lists of non-empty buffers on levels for final emptying */
+	List	  **buffersOnLevels;
+	int			buffersOnLevelsLen;
+
+	/*
+	 * Dynamically-sized array of block numbers of buffers loaded into main
+	 * memory.
+	 */
+	BlockNumber *loadedBuffers;
+	int			loadedBuffersCount;		/* entries currently in loadedBuffers */
+	int			loadedBuffersLen;		/* allocated size of loadedBuffers */
+	GISTBufferingInsertStack *rootitem;
+}	GISTBuildBuffers;
+
 typedef struct GistSplitVector
 {
 	GIST_SPLITVEC splitVector;	/* to/from PickSplit method */
@@ -286,6 +400,23 @@ extern Datum gistinsert(PG_FUNCTION_ARGS);
 extern MemoryContext createTempGistContext(void);
 extern void initGISTstate(GISTSTATE *giststate, Relation index);
 extern void freeGISTstate(GISTSTATE *giststate);
+extern void gistdoinsert(Relation r,
+			 IndexTuple itup,
+			 Size freespace,
+			 GISTSTATE *GISTstate);
+
+/* A List of these is returned from gistplacetopage() in *splitinfo */
+typedef struct
+{
+	Buffer		buf;			/* the split page "half" */
+	IndexTuple	downlink;		/* downlink for this half. */
+} GISTPageSplitInfo;
+
+extern bool gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
+				Buffer buffer,
+				IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
+				Buffer leftchildbuf,
+				List **splitinfo);
 
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
@@ -305,7 +436,7 @@ extern XLogRecPtr gistXLogSplit(RelFileNode node,
 			  BlockNumber blkno, bool page_is_leaf,
 			  SplitedPageLayout *dist,
 			  BlockNumber origrlink, GistNSN oldnsn,
-			  Buffer leftchild);
+			  Buffer leftchild, bool noFollowFight);
 
 /* gistget.c */
 extern Datum gistgettuple(PG_FUNCTION_ARGS);
@@ -313,6 +444,16 @@ extern Datum gistgetbitmap(PG_FUNCTION_ARGS);
 
 /* gistutil.c */
 
+/*
+ * Storage type for GiST's reloptions
+ */
+typedef struct GiSTOptions
+{
+	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	int			fillfactor;		/* page fill factor in percent (0..100) */
+	int			bufferingModeOffset;	/* use buffering build? */
+}	GiSTOptions;
+
 #define GiSTPageSize   \
 	( BLCKSZ - SizeOfPageHeaderData - MAXALIGN(sizeof(GISTPageOpaqueData)) )
 
@@ -380,4 +521,24 @@ extern void gistSplitByKey(Relation r, Page page, IndexTuple *itup,
 			   GistSplitVector *v, GistEntryVector *entryvec,
 			   int attno);
 
+/* gistbuild.c */
+extern void gistDecreasePathRefcount(GISTBufferingInsertStack *path);
+extern void gistValidateBufferingOption(char *value);
+
+/* gistbuildbuffers.c */
+extern void gistInitBuildBuffers(GISTBuildBuffers *gfbb, int maxLevel);
+GISTNodeBuffer *gistGetNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
+				  BlockNumber blkno, OffsetNumber downlinkoffnu,
+				  GISTBufferingInsertStack *parent);
+extern void gistPushItupToNodeBuffer(GISTBuildBuffers *gfbb,
+						 GISTNodeBuffer *nodeBuffer, IndexTuple item);
+extern bool gistPopItupFromNodeBuffer(GISTBuildBuffers *gfbb,
+						  GISTNodeBuffer *nodeBuffer, IndexTuple *item);
+extern void gistFreeBuildBuffers(GISTBuildBuffers *gfbb);
+extern void gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb,
+								GISTSTATE *giststate, Relation r,
+							  GISTBufferingInsertStack *path, Buffer buffer,
+								List *splitinfo);
+extern void gistUnloadNodeBuffers(GISTBuildBuffers *gfbb);
+
 #endif   /* GIST_PRIVATE_H */
#131Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#130)
Re: WIP: Fast GiST index build

On 05.09.2011 14:10, Heikki Linnakangas wrote:

On 01.09.2011 12:23, Alexander Korotkov wrote:

On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

So I changed the test script to generate the table as:

CREATE TABLE points AS SELECT random() as x, random() as y FROM
generate_series(1, $NROWS);

The unordered results are in:

testname | nrows | duration | accesses
-----------------------------+**-----------+-----------------+**----------

points unordered buffered | 250000000 | 05:56:58.575789 | 2241050
points unordered auto | 250000000 | 05:34:12.187479 | 2246420
points unordered unbuffered | 250000000 | 04:38:48.663952 | 2244228

Although the buffered build doesn't lose as badly as it did with more
overlap, it still doesn't look good :-(. Any ideas?

But it's still a lot of overlap. It's about 220 accesses per small area
request. It's about 10 - 20 times greater than should be without
overlaps.
If we roughly assume that 10 times more overlap makes 1/10 of tree to be
used for actual inserts, then that part of tree can easily fit to the
cache.
You can try my splitting algorithm on your test setup (it this case I
advice
to start from smaller number of rows, 100 M for example).
I'm requesting real-life datasets which makes troubles in real life from
Oleg. Probably those datasets is even larger or new linear split produce
less overlaps on them.

I made a small tweak to the patch, and got much better results (this is
with my original method of generating the data):

testname | nrows | duration | accesses
-----------------------------+-----------+-----------------+----------
points unordered buffered | 250000000 | 03:34:23.488275 | 3945486
points unordered auto | 250000000 | 02:55:10.248722 | 3767548
points unordered unbuffered | 250000000 | 04:02:04.168138 | 4564986

The full results of this test are in:

testname | nrows | duration | accesses
-----------------------------+-----------+-----------------+----------
points unordered buffered | 250000000 | 03:34:23.488275 | 3945486
points unordered auto | 250000000 | 02:55:10.248722 | 3767548
points unordered unbuffered | 250000000 | 04:02:04.168138 | 4564986
points ordered buffered | 250000000 | 02:00:10.467914 | 5572906
points ordered auto | 250000000 | 02:16:01.859039 | 5435673
points ordered unbuffered | 250000000 | 03:23:18.061742 | 1875826
(6 rows)

Interestingly, in this test case the buffered build was significantly
faster even in the case of ordered input - but the quality of the
produced index was much worse. I suspect it's because of the
last-in-first-out nature of the buffering, tuples that pushed into
buffers first are flushed to lower levels last. Tweaking the data
structures to make the buffer flushing a FIFO process might help with
that, but I'm afraid that might make our cache hit ratio worse when
reading pages from the temporary file.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#132Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#130)
1 attachment(s)
Re: WIP: Fast GiST index build

Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
doesn't exceed maximal offset number.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.14.3.patch.gzapplication/x-gzip; name=gist_fast_build-0.14.3.patch.gzDownload
#133Stefan Keller
sfkeller@gmail.com
In reply to: Alexander Korotkov (#132)
Re: WIP: Fast GiST index build

Hi,

Unlogged tables seems to me to follow a similar goal. Obviously GiST
indexes are not supported there.
Do you know the technical reason?
Do you see some synergy in your work on fast GiST index building and
unlogged tables?

Yours, Stefan

2011/9/6 Alexander Korotkov <aekorotkov@gmail.com>:

Show quoted text

Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
doesn't exceed maximal offset number.
------
With best regards,
Alexander Korotkov.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#134Alexander Korotkov
aekorotkov@gmail.com
In reply to: Stefan Keller (#133)
Re: WIP: Fast GiST index build

Hi!

Unlogged tables seems to me to follow a similar goal. Obviously GiST
indexes are not supported there.
Do you know the technical reason?

GiST using serial numbers of operations for concurrency. In current
implementation xlog record ids are used in capacity of that numbers. In
unlogged table no xlog records are produced. So, we haven't serial numbers
of operations. AFAIK, it's enough to provide some other source of serial
number in order to make GiST work with unlogged tables.

Do you see some synergy in your work on fast GiST index building and
unlogged tables?

With tecnhique discussing in this thread GiST build can win form unlogged as
much as with regular build.

------
With best regards,
Alexander Korotkov.

#135Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#132)
Re: WIP: Fast GiST index build

On 06.09.2011 01:18, Alexander Korotkov wrote:

Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
doesn't exceed maximal offset number.

I've committed the patch now, including that fix. Thanks for a great
GSoC project!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#136Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#135)
Re: WIP: Fast GiST index build

On Thu, Sep 8, 2011 at 10:59 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 06.09.2011 01:18, Alexander Korotkov wrote:

Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
doesn't exceed maximal offset number.

I've committed the patch now, including that fix. Thanks for a great GSoC
project!

Wow, major congrats, Alexander!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#137Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#132)
Fast GiST index build - further improvements

Now that the main GiST index build patch has been committed, there's a
few further improvements that could make it much faster still:

Better management of the buffer pages on disk. At the moment, the
temporary file is used as a heap of pages belonging to all the buffers
in random order. I think we could speed up the algorithm considerably by
reading/writing the buffer pages sequentially. For example, when an
internal page is split, and all the tuples in its buffer are relocated,
that would be a great chance to write the new pages of the new buffers
in sequential order, instead of writing them back to the pages freed up
by the original buffer, which can be spread all around the temp file. I
wonder if we could use a separate file for each buffer? Or at least, a
separate file for all buffers that are larger than, say 100 MB in size.

Better management of in-memory buffer pages. When we start emptying a
buffer, we currently flush all the buffer pages in memory to the
temporary file, to make room for new buffer pages. But that's a waste of
time, if some of the pages we had in memory belonged to the buffer we're
about to empty next, or that we empty tuples to. Also, if while emptying
a buffer, all the tuples go to just one or two lower level buffers, it
would be beneficial to keep more than one page in-memory for those buffers.

Buffering leaf pages. This I posted on a separate thread already:
http://archives.postgresql.org/message-id/4E5350DB.3060209@enterprisedb.com

Also, at the moment there is one issue with the algorithm that we have
glossed over this far: For each buffer, we keep some information in
memory, in the hash table, and in the auxiliary lists. That means that
the amount of memory needed for the build scales with the size of the
index. If you're dealing with very large indexes, hopefully you also
have a lot of RAM in your system, so I don't think this is a problem in
practice. Still, it would be nice to do something about that. A
straightforward idea would be to swap some of the information to disk.
Another idea that, simpler to implement, would be to completely destroy
a buffer, freeing all the memory it uses, when it becomes completely
empty. Then, if you're about to run out of memory (as defined by
maintenance_work_mem), you can empty some low level buffers to disk to
free up some.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#138Oleg Bartunov
oleg@sai.msu.su
In reply to: Heikki Linnakangas (#135)
Re: WIP: Fast GiST index build

My congratulations too, Alexander ! Hope to work on SP-GiST together !

Oleg

On Thu, 8 Sep 2011, Heikki Linnakangas wrote:

On 06.09.2011 01:18, Alexander Korotkov wrote:

Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
doesn't exceed maximal offset number.

I've committed the patch now, including that fix. Thanks for a great GSoC
project!

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#139Alexander Korotkov
aekorotkov@gmail.com
In reply to: Oleg Bartunov (#138)
Re: WIP: Fast GiST index build

Thanks for congratulations!
Tnanks to Heikki for mentoring and his work on patch!

------
With best regards,
Alexander Korotkov.

#140Stefan Keller
sfkeller@gmail.com
In reply to: Alexander Korotkov (#134)
Re: WIP: Fast GiST index build

Robert,

2011/9/6 Alexander Korotkov <aekorotkov@gmail.com>:

GiST use serial numbers of operations for concurrency. In current
implementation xlog record ids are used in capacity of that numbers. In
unlogged table no xlog records are produced. So, we haven't serial numbers
of operations. AFAIK, it's enough to provide some other source of serial
number in order to make GiST work with unlogged tables.

GiST is IMHO quite broadly used. I use it for example for indexing
geometry and hstore types and there's no other choice there.
Do you know whether unlogged option in create table will support GiST
in the next release?

Stefan

#141Stefan Keller
sfkeller@gmail.com
In reply to: Stefan Keller (#140)
Re: WIP: Fast GiST index build

I'm on the way to open a ticket for hash indexes (adding WAL support) anyway:
May I open a ticket for adding GiST support to unlogged tables ?

Stefan

2011/9/14 Stefan Keller <sfkeller@gmail.com>:

Show quoted text

Robert,

2011/9/6 Alexander Korotkov <aekorotkov@gmail.com>:

GiST use serial numbers of operations for concurrency. In current
implementation xlog record ids are used in capacity of that numbers. In
unlogged table no xlog records are produced. So, we haven't serial numbers
of operations. AFAIK, it's enough to provide some other source of serial
number in order to make GiST work with unlogged tables.

GiST is IMHO quite broadly used. I use it for example for indexing
geometry and hstore types and there's no other choice there.
Do you know whether unlogged option in create table will support GiST
in the next release?

Stefan

#142Robert Haas
robertmhaas@gmail.com
In reply to: Stefan Keller (#140)
Re: WIP: Fast GiST index build

On Tue, Sep 13, 2011 at 5:00 PM, Stefan Keller <sfkeller@gmail.com> wrote:

2011/9/6 Alexander Korotkov <aekorotkov@gmail.com>:

GiST use serial numbers of operations for concurrency. In current
implementation xlog record ids are used in capacity of that numbers. In
unlogged table no xlog records are produced. So, we haven't serial numbers
of operations. AFAIK, it's enough to provide some other source of serial
number in order to make GiST work with unlogged tables.

GiST is IMHO quite broadly used. I use it for example for indexing
geometry and hstore types and there's no other choice there.
Do you know whether unlogged option in create table will support GiST
in the next release?

It's probably not a difficult patch to write, but I don't have any
current plans to work on it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#143Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#137)
Re: Fast GiST index build - further improvements

On Thu, Sep 8, 2011 at 8:35 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Now that the main GiST index build patch has been committed, there's a few
further improvements that could make it much faster still:

Better management of the buffer pages on disk. At the moment, the
temporary file is used as a heap of pages belonging to all the buffers in
random order. I think we could speed up the algorithm considerably by
reading/writing the buffer pages sequentially. For example, when an
internal page is split, and all the tuples in its buffer are relocated,
that would be a great chance to write the new pages of the new buffers in
sequential order, instead of writing them back to the pages freed up by the
original buffer, which can be spread all around the temp file. I wonder if
we could use a separate file for each buffer? Or at least, a separate file
for all buffers that are larger than, say 100 MB in size.

Better management of in-memory buffer pages. When we start emptying a
buffer, we currently flush all the buffer pages in memory to the temporary
file, to make room for new buffer pages. But that's a waste of time, if
some of the pages we had in memory belonged to the buffer we're about to
empty next, or that we empty tuples to. Also, if while emptying a buffer,
all the tuples go to just one or two lower level buffers, it would be
beneficial to keep more than one page in-memory for those buffers.

Buffering leaf pages. This I posted on a separate thread already:
http://archives.postgresql.**org/message-id/4E5350DB.**
3060209@enterprisedb.com<http://archives.postgresql.org/message-id/4E5350DB.3060209@enterprisedb.com&gt;

Also, at the moment there is one issue with the algorithm that we have
glossed over this far: For each buffer, we keep some information in memory,
in the hash table, and in the auxiliary lists. That means that the amount
of memory needed for the build scales with the size of the index. If you're
dealing with very large indexes, hopefully you also have a lot of RAM in
your system, so I don't think this is a problem in practice. Still, it
would be nice to do something about that. A straightforward idea would be
to swap some of the information to disk. Another idea that, simpler to
implement, would be to completely destroy a buffer, freeing all the memory
it uses, when it becomes completely empty. Then, if you're about to run out
of memory (as defined by maintenance_work_mem), you can empty some low
level buffers to disk to free up some.

Unfortunately, I hadn't enough of time to implement something of this
before 9.2 release. Work on my Phd. thesis and web-projects takes too much
time.

But, I think there is one thing we should fix before 9.2 release. We assume
that gist index build have at least effective_cache_size/4 of cache. This
assumption could easily be false on high concurrency systems. I don't see
the way for convincing estimate here, but we could document this behaviour.
So, users could just tune effective_cache_size for gist index build on high
concurrency.

------
With best regards,
Alexander Korotkov.