Switching timeline over streaming replication

Started by Heikki Linnakangasover 13 years ago86 messages
#1Heikki Linnakangas
hlinnaka@iki.fi
1 attachment(s)

I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby server
that you'd like to keep following the new master server, you need a WAL
archive in addition to streaming replication to make it cross the
timeline change. Streaming replication will just error out. Having a WAL
archive is usually a good idea in complex replication scenarios anyway,
but it would be good to not require it.

Attached is a WIP patch for that. It needs cleanup, but works.

Protocol changes
----------------

When we invented the COPY-both mode, we left out any description of how
to get out of that mode, simply stating that both ends "may then send
CopyData messages until the connection is terminated". The patch makes
it possible to revert back to regular processing, by sending a CopyDone
message, like in normal Copy-in or Copy-out mode. Either end can take
the initiative and send CopyDone, and after doing that may not send any
more CopyDone messages. When both ends have sent a CopyDone message, and
received a CopyDone message from the other end, the connection is out of
Copy-mode, and the server finishes the command with a CommandComplete
message.

Another way to think of it is that when the server sends a CopyDone
message, the connection switches from copy-both to Copy-in mode. And if
the client sends a CopyDone message first, the connection goes from
Copy-both to Copy-out mode, until the server ends the streaming from its
end.

New replication command: TIMELINE_HISTORY
-----------------------------------------

To switch recovery target timeline, a standby needs the timeline history
file (e.g 00000002.history) of the new timeline. The patch adds a new
command to the set of commands accepted by walsender, to transmit a
given timeline history file from master to slave.

Walsender changes to stream a particular timeline
-------------------------------------------------

The walsender now keeps track of exactly which timeline it is currently
streaming; it's not necessarily the latest one anymore. The
START_REPLICATION command is extended with a TIMELINE option that the
client can use to request streaming from a particular timeline. If the
client asks for a timeline that's not the current, but is part of the
history of the server, the walsender knows to read from the correct WAL
file that contains that. Also, the walsender knows where the server's
history branched off from that timeline, and will only stream WAL up to
that point. When that point is reached, it ends the streaming (with a
CopyDone message), and prepares to accept a new replication command.
Typically, the walreceiver will then ask to start streaming from the
next timeline.

Walreceiver changes
-------------------

Previously, when the timeline reported by the server didn't match the
current timeline in the standby, walreceiver simply errored out. Now, it
requests for any missing timeline history files using the new
TIMELINE_HISTORY command, and then tries to start replication from the
current standby's timeline, even if that's older than the master's.

When the end of the old timeline is reached, walreceiver sets a state
variable in shared memory to indicate that, pings the the startup
process, and waits for the startup process for new orders. The startup
process can set receiveStart and timeline in shared memory and ping the
walreceiver again, to get the walreceiver to restart streaming from the
new starting point [1]Initially, I tried to do this by simply letting walreceiver die and have the startup process launch a new walreceiver process that would reconnect, but it turned out to be hard to rapidly disconnect and connect, because the postmaster, which forks the walreceiver process, does not always have the same idea of whether the walreceiver is active as the startup process does. It would eventually be ok, thanks to timeouts, but would require polling. But not having to disconnect seems nicer, anyway. Before the startup process does that, it will
scan pg_xlog for new timeline history files if
recovery_target_timeline='latest'. It will find any new histrory files
the walreceiver stored there, and switch over to the latest timeline
just as it does with a WAL archive.

Some parts of this patch are just refactoring that probably make sense
regardless of the new functionality. For example, I split off the
timeline history file related functions to a new file, timeline.c.
That's not very much code, but it's fairly isolated, and xlog.c is
massive, so I feel that anything that we can move off from xlog.c is a
good thing. I also moved off the two functions RestoreArchivedFile() and
ExecuteRecoveryCommand(), to a separate file. Those are also not much
code, but are fairly isolated. If no-one objects to those changes, and
the general direction this work is going to, I'm going split off those
refactorings to separate patches and commit them separately.

I also made the timeline history file a bit more detailed: instead of
recording just the WAL segment where the timeline was changed, it now
records the exact XLogRecPtr. That was required for the walsender to
know the switchpoint, without having to parse the XLOG records (it reads
and parses the history file, instead)

[1]: Initially, I tried to do this by simply letting walreceiver die and have the startup process launch a new walreceiver process that would reconnect, but it turned out to be hard to rapidly disconnect and connect, because the postmaster, which forks the walreceiver process, does not always have the same idea of whether the walreceiver is active as the startup process does. It would eventually be ok, thanks to timeouts, but would require polling. But not having to disconnect seems nicer, anyway
have the startup process launch a new walreceiver process that would
reconnect, but it turned out to be hard to rapidly disconnect and
connect, because the postmaster, which forks the walreceiver process,
does not always have the same idea of whether the walreceiver is active
as the startup process does. It would eventually be ok, thanks to
timeouts, but would require polling. But not having to disconnect seems
nicer, anyway

- Heikki

Attachments:

streaming-tli-switch-1.patch.gzapplication/x-gzip; name=streaming-tli-switch-1.patch.gzDownload
�rOPstreaming-tli-switch-1.patch�<iw�F���_���iQ��%9�%�����=�7��
#��0�l�����F�;������X$�]]U]wW�|_lnN�\8=/v{Y
����^��y��a���
/������|������nw�t���xGlom=y�hmss�F�k7���G������y&6�������5����bs�S1�=)�LQ�N.=1��H8b��W2����;aeJ���,�"���w�������������|�?:��N�{���r>�Q�b8U@�!!Z/��%q�I1�Y�L��c�O���8�a,��ak�_h@�Te��9|����
7cE�!�s�(�n��$��,��)����b�s*}��x��L�Le����/�k��`�d#SQg�q&u�Y$R���SM���p���P#���=�,�V�&��}��2YS'S�@vjX����pb
�FUP�/�<���WyU���)�����?��&��J�'�,%8J7a�\Xg�(���L�������'u^�	��_H�vou�����>~�y��
�]/���2A*��J8�|�F�?9id@v�.�"��g$��(��:'�c���8�/����{�[�l�G���y���d�?=��_�.>iC�c�]��_���g!3�:����7��;�/�<f�,�`��64���S�F�����?��'�����s����8w�����:�; $(!�X�z�s������t �d�[yj��~j���	e��
��``me�J�G���V�
M6�S8w�L��_0�ilHA-s0��,������xU)�-#V���/��Kz�-�xzv3�1�U{]X��*��A�Z��4����0O��eyZ��<KsD�Kg����p����*r�(��$���wD�l��L�[W��8�^^A����u�����L�8���t�I�&I����A�S
��_��KIt������f������L)0���@��[:��G����w��Uk�m�aHjD@c��-�E,�4����D����"��.���H/Z���}�jg�9��A�r�`�H���S'_�`��F�����#�E�H&(�������J�������uN��#�P|`5L�`��;�K�K� ��D�=��7�6jj�v���d��H��R���P kXk`ii
f,�4��������ai�x"���w{��b����P�YV��n!7�VD6��U�%AQD+1�V�D�4D�=���"$��BG;��B8�"���5�N�0�Q�]�!����e����s%)��e�J���w��-��>�q��'����]�pv��v(0�|�Qa'�q����@t�=��5�
������Z��6��Av'a<vB�����nO� ��6|��	|���9�Ig��daZ��bLC��B�@
������)�(��8R�?�"��/"7S��F�ym�a�����`��~wRw
iu�/O4�Q6��;���w��{�V���P
H�(lWR�2���u����^*a�y�S/*��v��VA���l��t?���}��p�[����/��k��
��IF9�
�$�Se�V�bYWC$��q�@P��`2�E�m��g��l�lm�t`D�ORy���$YC��	���4.�[ =�@�1����K
���#'��(pZw���My�COz_�p3a����Yy���^w�/�g���Gm��8�Zz�Z4�$��	��I?�������=�d�y8L�)�dX�}=2��+��/�L������O����p���~&Z�n�w�Bt�v����'x�Mhw��\b4��cu��D���,+f��Z�CK�U0,Aib�u��(2f�D����+lB��j�8�G
a���alZ�{�X@2�9��^��/0x8�^�!��{���:i��8�������>?��{�I�1Y���N!]89����\m�����V�=�Z�������u��:Y��8��=�FdAsp[x��xo�|����~�Q�����	�(��X�� �e!�6p���h�\k5�m�/
�����_9���3�6_0�f5�w@�	l��~O.�{��4TI��h^x�E�i���#HL?:d�3���0�
5L�Oa�Zf�:�0�Z$�R��pF��#�9�������\|P��fGlil2�����B��,}?D�B�n��L�i�1���``����t���;'�g'�C|���Lw���RMz�h�P����%c�6���#0��e�V���$
T��) ��� ��|.��TD��6D��S���M�{��,m�{�������-�Ue& �?�R��_��Jr����6|�����k�`Y�TPb�o����6���I�!�
*�` iJ &{�:�>�>��h/�&�u��D	�ynN?���@��N�a��hz��0f4���={���66h���{AF$�ZE��H"!N��)m����������@��Z��&����<I��
�,��kn7��{���|���U*k�Xi����`���4pK?F<��Y��D���w�
�e����� ��/X�7D�)+=B��#7�)F��r�3W:��lZP�HVbf����yB�Q�~m�&���L��w�	�k�@(p$��*��I��5���R��X"�L@P�t�a&A��r��X���.Di�J���<j�VU��m��/vw�o�Y�` �/1�����vH9�#��)���L��"c��Q�qK�!:���� 3'�s_��SK����J%�0T{��Ya<vM,VI���B�vm�;J�n	H:������ ��P�"��L�#�E�.��-�\<>y���.�Bq�����]2��yP`��Ki$���������2OA��v�
��o4�'�Bi��6�*g�9���tP[�(��q�����#����VcK���S%������QBR�����A�n����� ���^�z?S�@��l����\�rp���#JWVz��"�,��6���i�^��1��|*�h�\sZ&?Xq�C3�.Y�\e"�8�6x�
�H�k���Wg���[�C�!h&�x���G�o��?)����{�]�kV��$O).�1�_��6���������#���N*�����s-E����<@F�G��:.�������,R�v���d���'.��pR�����6@�R���G6j�|*��!d�� 0x|��t��?f�X	X:QP�k[_,l�O�S[�5�
[�>�Q9*����jZ��kN8�S�:fH'�=bYcp���U��Rh��UA:����\���������8	���
��~0i��tCSg��mJ	����3���H)�+���D+q�W6\�8�,��/a��h�g0e
R4���r��M�P�~*��:[G�`J.k2�������(f�JjO�!�"[���@��a����i$$+D=�~h�����#���KNjB�3")RP0:�RD)�
F9��.�	��aC�����^�,��L�X��TQ-��u/�_���K��`W���t�h@�Q@2�|��U9��g����"+�
����.o���������1��KfnGX�T%�@���A'���y�|�|u�p\��Ly98���?�9 g���C�j��!e��������EIy{O`�
�"���������rc��!��N���(a���E�,J ������#���������Z7��>��-q����>��i8m�x1�|�O�1�4����������)��3P�||���L�O>"2���_B&����$n�>]����lttqr8�O'���[���r��x{����;�d	���������������Wxt�m��O�j�����B�z����mZ��h�tvf���r���I��� �!�����{{v:���#�<^m���4{G���*�[;11��J�Inzo�r�L'y�^`�Fz��%�lA�c���U�FRs�qX4��R%�����Vq�N�n����A�E�����x���H����P�X}�)�Bm�!�i��W{��H	�"8@�	(��z�)Xw�'��C��K�����+3�*��+��cuU���`@��I(� ��P�8��5W�������������� ����b����7��
��7��E�:C���0I�D��mT����d[�N�����0%a���3�uN|�)�RG�Q���mp����CCq����'��J"�y7bH���*��Z/���d
��'��2�m�`�M��f��qutw�Z@/^x�Y�V���#�c_�<
e�������_��l�oVDV��V�R!�:,+��W6���[�w���d5��(�j���o�p�v{������6�B��c`Y�SH�g����U�/������s;?���@_�pL^���S	�Aw% �%�Eh:�5�1�5@��Y����9e��9���E���*�`Wu��7�\�F�T�r(�:�����y0R-��>�}n�,]��R#�D����c1�����>l��������]��?}=�OR;o��
�y��k�b���B��*�-�V��
�����/��W�J
��
�j�����U�.���tN�.t�ThP��96��
�T�B��V���!�����EUt����vSvs�@q����)X��_�������h��
�	:W%/�:�:�oBlb�+��+����s.��'��[�v� ��|5_�K���-����}U�b�Cn��'_���V���M*���;�X�U��Kq	k�|zm�~�;+~�%�z��7D7�P�� �t5+����7��H���������
��6���C���7����q��(�tX�r����b�]�:�+�����%�5���n��/�U�Ym)��]��eL?y^��={��r?NuS#�9�C�/�$��7~�y�`W6��Vc�8��Hu3��@�D��7�>h�-;����)Rw��r��|��P�"��7Ki���	]qD�.|�������Y���`��EZ��`�
��v�=ny��$���A@��i��+�T+�s����B�����&��<�6��`�q*����t�Bu`�k�x8��`Qq�����m��Cy7t�"��>����<;�j�&���d����3�'���
�j�yA���0y�r?��	^��Ez�^=_-�A]'2/���M��Z��[_�Y'��2���E�k����p�LC�m��Q�c���S)G��I���\I���O���F*��������n��Z$VM�T9Up�y���A� ����p���\�s&$�����F/��;�<��L~��v���t�3}�UC�hhC@DU���Z`YR���� rj�����j�p�.�w�V;\�*R�>�2P�:x�b��~�gB�����Q9�+�n�l����v�U1�=Z���&�_%�H��
�w��Jz��W��J����;O�]�l�<���|WZA������^w�t��
��)�Z��}E��<m�xS�c^���x^y
��G�<W��^��H���MM�y�/B�[�	��{$q����3��&�/s+��K��,�n��q9<<=~�at��<|989�����sZ�k7
��o7���_hz��R�d��j��Q+6(�������q�[�^ax�L�*�^��`�;LU5x;J��h��)��P)�;��l�a���g����?���������dn>��_6��~���L��a1�]',|�b�gr��X�����>��������!'�
��"�����P9S�[��h�6md��Z������;'+;�Pi8�L��Ey�@G`��D&��	�`�Z��T�������*R����(���E��ol�u��Q�8��'�2�w%V�<���a���(�NT$�����E�ER���`��
/^�GoO/O�{l����v���!�%,����d�k
���$�mM_�
$w�����N	���{5�[����y
!u�8�����5���d"S���#�A�[���G)9dR.�8����CAV���~6��y�rV�Gx��C]���Y��`m���Z�s�����J{|���C`�����t���%0f�����A�:�]�.��vn(77�U����]m$W����S�����$#�����6���p<9��Z��-����y2~?���U��"v�<�9gVb���������9����%�y�)�*>�������<��[KG#�%��gu4�-��G"�2>�2&zqO�a�'@mK�MR�:�2��\��osc=�Ht�U�3Df��u�\2��FW�`����S���)��HI��~&����C����u�nB-�-�[���������@�������hoD+[���v;���+�x�S\�#~��V
>W�<�}r�����5�"�H(�|�D@AQ���')��>��W~got�<�k��
HVC^��X��'�3h�*��:��u����y[����I`��y���z�E��L��8g�K8tW��g�P�^��\ke�
�v��G�f���\P�t��!���u}^�������7����Ne�9�>h��a_
�a�Y�w:SG@N�X��K=�gxX�[����i����j���|�V�A����;�c?&�K��m�s��u���|��mHG�^X'���6�a����ml�Ee\
�Xg�`k����������v������]��^6�8�}c��-_��q��3���?*���I��9RZ�F��V����A���%o~�l�X%��f|^�x����A�;1����r���bBd�W���.�'�{~c�����tH�\F�;�|�ex'�A�
������k;[mb�v6wx_�u���(�����d�=��i�����(N
���y��J���e5_:h*�73�V��.���.�E�6��c�c�k���h��=�a�=��+���o]����j�Vd�g#�v!m�M�x^A��y�]+���_���<�^��/�t!8H���&��m'�*w0
�hc����
�9
�����!<���a�~b�R��7QcN��N���fc}�G�@|��$�;�A�U�5���w$~)�����zb�3���k��]���g��5�C�"��e�h��K�z/�58�!���T�DSky�*.������!}�����1�3�k����S[�_�N>�SW��x��v��7����t��@���V��=a����?(D�;��g��A����2��#�6���nA���7{�NMO��O}IzM�L��R�:��Y&��r��0�Nr�q�G�����$�2�K�#3^�v8�4���i�)���d� ��������4�L�/6Tp��/U�r�+-2$&�.;2G����|I�#�����<������;���������a��'�]L���9����#|h���H��r���TPE���2�"�@��(���G��$>#�
(�����H����U���?h+DK\/%��S�5P�1}�NE��#�!rz< =#nl�at�L!q��A�x:��{�3*H��-P���Fb��QW�S�@@p'5���H?!s��F�b*}bn�$l���GV4����%~����?�}�w���d���PTp]��a���	������4�����]5�`M
��Q|��Y�v��u��p�\�{��qn������`Pi(� �d����e%�MO���7�6+��n��h�`�W�Y6���Me)V��X�PN��v�!����UZ&��\Xf&����]'|����2%���Mx�`4�B+D��N~�l���"WO��m!(Vz�uGRx-��nPQ��8��E;���g����'��N H�t��&��ot���@����d0�{�@������jB�xB��HE3��C�7T�T`$�7d)��� �t;��!vP������p�c�3�nE�����)�
����k��T
��D;��el�	yp3t&��k�f��So�$��h:�n�d*��	��\Z��h"���M�;��:I���l����-az�Ho
u�m2�[X�\*��4��A�G`�Y�s.�ti�Z����������Vw�Aqlt%��g7'���SG������nE{p���l��B����o<i=�tD.�1:���r�8-���S�`&��9��CBs��y�Q7n����e:��6
��}x��fi�BRKc�(.��;P�edT�l���"���N�-$}p$S���h��Ki_�Zd���y�Ld��� K7����/���E�:���d�!uN����e$8b��b��Nf���?%��k�a�yv���B:Q��_�2�Y0��S��KJ��K����g;���+�lT&�v�	O�����S:�[�?�/8)�����q�t�	�;��,'g�0�h��bJ�F'0$G��o�,��f1��h���F���C#��.��%g	���Aw'^�����R��zG�Q��SJ�V��2�4��}�j\���I=uD\���E�����n@H�0����%I�zn�zc��B����L�E��o�����)����
�A���[�n<'�.\�+27��
=|L��� +��{�|-��%
�~<~�LV��������@�<zl+�����������������M�7�%�K����z���C��3��;X�f�A�vm�����p?�p�Y@p{I6���e��
�G����h�h��@�n�V�N��.>�]$m���GL2a=,��,������^� ����=��^��g���()����� 4:Y����Ho}��Q<������em�[�*��I�(�q5�Ke��L���9~��Q���A`v��J�_b�F�n��VE�F�^t���'=�fO��^���@a�=�������)�wmN���y�M��Zn�6&g(�*�J�r"��z��VF4f�D�=����p>�X\T�_��L��@*�����%� oNi���Wx���Z
�����~�V��Y+��i��}�z�4k��������gfI'��B�W��WEL��B��/`r\��C�4;OX����@��iB<WD{(�����������	�x�WR�MB��B�^����U�@}�S;��$�$W��_m������^"���%�
��;�j��q�����	����y�O��:y�sU#������P��u1'x�9a�����<��@D�wB�p�Y�����c���-���T[(��@[��H�e���@9��@J�`5<T��t�@�Bf�KC�?
�[*���y��Z�za�s�C��]���k��_�p����4�N&�z����� �
�7�'
��d��yw��'�`x��z�r	������w��XN�#
b����S������w|��	�D/1iF"�qCD�|4\�x����}@���>�b���	���O!�����0^�G�%E�S��]�*�~8r��{W�Rv:�1����|
�}1�E�3t�!�#a{��L3��3�X����jz[�/T��8��	P���b���w�9G1qC��S���<529&P��g�����q�f�1�P�F�����q�t?0�J�� ����b#������P'Omn���1C�R�Rg?=x{pxF
���~88k�5�>��s+����>+�*���1%�zN���H�Z^����Gi���t��]��)�=���P�kg�'�
v+�1���(d��i�6E��������~��������hF�z��k�(��|�9U�,m�l�6C��Z����y����P�'��T64��>63O:al���.Q����pc����z�-�=�T�S����(v����%Je������%�b+'�;f�f?����f������S�>�_�S(��v��G���
0���M�Nvf�2��������
�oo��
��E����M��6Ct?�V��y�K6RC��e�x���T�JC�Go��r���U����7�"�{��_�N�zq}�!�
��]<n+�����=��|[�}������l���)=�i�%,��_��|����@�2��P�����:0��-B��Uf�~���_���Yp�
���UdvU�����l��R�j(� �[WI������c�d�LF�%-hpl���Pv2HI*�h��X��)<w_�q�������jK�8��p<��C�U#���O#�5�$$���?Ba`���x,%3��������+��A�S+�3x�D�.:T��$�N�jX����t�1����$�;�~�����G�����<��vs������G"0^@�����bX�������Zz��CM�]gED��t`};V� ����}�sP!t��}v��.�O����?��d�;���u&�|���/��X�����.�~��Rk����������{'���<��d�X��qwCF�c<�����;Y�./�������#���p�����a�oD]�(h��B��!��|���;E�;���G��T�|po�����R^:>M�����q�!<r���H��9m�������[)�*�5���w����$Y����TC�\<��P��C�e�����s��v������O	��\�E,��6��7���x8��x�[l���|����{����8>C�y���[F��K�$�����I��^u����o��*�t4�+���p%�N�����x��.��)z�s�c:����;�������������_��Y,�
=�Y������4�K��3x��	��4�����	�Q)�� JN� �#�l�O]-��O�P�GGhp�4vQM�%�:��Z.��3���%�dS~�to��<�nj�b{NL=M�i���<u9"�^�3jcj�f��X��p���sm��o�����^~���p�O�4��$?0�X��J��I�	�)��KB�1i*C�#3
��"N�(��]
�^�Y����{��Z�D����$j�uu��>Q��%�+��9[/��)
���(R,|������C)��h<[$��0���Tg&�\��{=D��zg�{l�XDxY�X ����V_D��__�l�7Va�w�$x�����	#^��n�D�[�f��r�cS�/�;
�uN4�T���S���[H���B]n����Y����[�0�oM��,���\<s�:	�$�6o�<��oK�z�-��D�����<��9�����4��~���`��
�<�*�5���������,���,�A'6�����dv[F�f(�}u������%���)�p�D��;&"��&"��w������;$"oj�\�ijG���!���M�X�/0�;�)g��7�)�u](�w3����Pd`\�hmv�$�z�������Nv��{�/���e������*L�O�L�bj�[p���B	����_`S�Ni�KS������������\v�f����x�(�aI���H��h�x�������9��XN�s�f��6��8���w�x��o�S��s�&��>���uD$�mln#^����jc�5
U�!(�O�:+b�I*���k|�5I��x�oS�Z��z�`w����Z�
����mcck����A��m�/��WT��h��C�sO�|�ai����q��_��S�E��*�;1�2�A:��i�?���V0(����7����@P�;g���Gf���Z��L��������^�97�2�::�d����8[���f��N!�bl�Z��0iQ���6P�CcT��t��A}SQ�;�Q�]Q
�(���rA`
�^@�z���\|������>��������	3t�����2ge�L���
�UW�A��cP���h��xB.���&��A������*w 
C��NsP~9��8�:*�Q78�@���'��8�u���(�,�XQ�H�y���h�N8/X4|N1ihg5V����d���F�[r�P��j:|��#dt�CQ#�cu��o���8'i���"rs��8����F����@���������I?�?
F��3dC�;E����~��<h\M���/��rx�����Q�3��y�g�C��z��<s\h�'���p������N
y�NA*c���p��*!bt��&�p�A���L��l�f y���+���,�?G��i����������<J� �H�W�_���a�������Z��������6��N���r�������"���Iq����;���E�@������ ^�������98hB$����Re����B�����Gg�</�7,����-��@��/��6��e7KD3�GE�b�Q�����9�����C���:��V��Kg��+N�n�r��#�=����@���������?-���3a��6I� �f��
�����f$�'��S:/�b/�"���h�+Rc��KwK���+��]E��_��,��BruYZ�E��s��=';��-����s����i�d���OJ���
���'(:�a�	�F��4N�!SsJ<V�)�$~��FL��+�d��	%e�'�t"����K8�tns��&���u��ge�i�x+48�$��s8gW���}�2~����2���)�y~y������O�.	��:���q��d�-�N�(���~��\�&�h��4J���x>|��0��0�8CR&
'n�VsQs�\�3j�,��g�v����q��
�������kV�]�8�c�UW�s��_��UJ���?'k�[��fm~���?A+��>����p	F��0z��B<��b����t\^6��+���
q9���JQ"q����:,�4hC�|U����n��NG�v
��CL
�aB����FY�@p3P4p`����j�M=���'T�Z�+i�sL�`�	���(dD�E3��'�(f��
sV�b:�����	;����S���������#�-��M���w���s�Q>����!���������1�5��h�98;x��I��mw ����L�N6��l���:��s6b����LI�]���(�m�{�Gi�9����m�������^�w�(���IZ���''o���x)~����;_����_������%���c�JN��]R�F2 ���{
S��8_*f�-*_mVF�2����-Z�*���Q��,��md�.�jW����S�.�����e5�g��K ��5���ds��[�!�N��ljMo1Y3m��w���y*�����PO���Mu���{����c^���9+&��_���[���t}
�B���9H���}�����U����/I�iswJEI�i�f���*�E���B������������sU�)�������i
+/lZ���^G+��c?
8}R���5�I��'(0�F���U�2
e=�����iN�s|�Y����a����3���Q���S��0���j�"W{k�L���<Fk�=��y��
�`��;����P��9�:�@��>u
98K�
��//�	N��{��|<:����m����/r=Z~GT!�]����`&�p��n0gR���$,1��4I����W�9�?�H=���\����jA���� ��'{ap�%��!g8;��s^Y��4+DK���_N���=�~�d��~m���Q�r���u�2����������[mv;E��l���%�:I�{F��������'���O�]B��>������+��m�"�A!�*�6�I�����7KT��N������W�����v��=��M%���o�r;�{Qe�[�.�S�>4E�q�����#�p�����^?�,�e���e����[Y��')�Ag���V}K!���:~����X�5m���I��x����b�'w���:�����dw�<�X���e����?u�V�.�������Mj��|���R|����{�L���o��[l���O]��R;�2?�5��C����|��:J���^%fxg���w����d�C���/6fU�����uM
����}�����������?�*F��K����n��/��"�K+��D ��x������/�	���d�q��
�/K��*�Trc����lSb����M�o�f8�.y]���-��_��c����@�e~��i�q��������n,�E;<'�m���H*D�S\�;��������7��Y����`�h�`�0p`w
������o{���0���z�:���E����W5��`IM{^L����� �e�f�[2����_j�7lt�R����9���&��t8C�]��&F���a1�����R���';|T!�p�p�I����q���xD}�CA<
�xF��1�<X��p���c�y�;@]Ai��4��]���8C��w���E��U�\!GM�7���W�%����Y:��<>�+���$����w��4�bg�����F{���N�M(�0������G��a��5R_p�Y{4�9o��#W��MOardlSQQ?����%�s��Jz�0�dY�������~���!�=v�Q�:�������y�A�y"���4�����R�j�z�P}	�;z`�`����7�#���vL���g�[)�����z����i���M�	8,�
p�s���������<��wrf���0�d;P��?=����N������-��T�|�)�;��(&vE�=��S	���m��B&�Z�Wkh���'����}�f��+�W,�'�.�F��^��*���y`�T��"��
z)���%+���Xk��l�6��iQ(5o9���W[?���cKD!�Tq�]
�+�(R�m\7�����q����~�,�N���P��s�SCU��2��/�H�#*I{m�����<i}�(*�� V!���J������,���E(���5RB�&D �J,�iIu������wq��>#p=i��YQ���������9�|&D7����!#�z���\�����)�0j'� ���%&����	����0��'�!X��3�����u,�]Il�\2��Ar�/��r#��'`��7d]r�������� ����_
���F+�S���^P>Z�~�9��#k�o`�C�.c�_k�]�[���(K4zP0���Ju��3����&����Q�}.K��G�s�`��e�N��7W��xZ��k�v�h�7������L�p����q<����]7
%�u��m��%�1�U]�>����pijc<%74��UM�)�o�S�l����S�g���:I�gB��RQ��������>-���2������7�G�D�2r��[�)2���,?�]���b6
����k��S=H�	��Q��t�� ��.�?��a(�y��5|��VYa�K����	�~��4'c8L�������A���@eIYLm��NI�w�m���C%g�s���:n/�K��p��H����$��K5������K�]��Fz���
�wm�F��KH����3U�3
����mc�Z�)S����~1s��$���q�Z�@})�=��U���'Q2���V������Xkr-w1W*�6��,)�\_kf�����������$$73�*#����xj��e�J6�{�"\V|F,�����]�%�f/Qi���(��__��
�p��D�
vp����{�9<@n�e�1{����e���w�����]O-Tt��F�3_�4�&�T������A2�<�������z5X������!%�C-�y2L���$9��I'����k� \�����s|��$����obNF�s�;o\�BW��wQ{$�D��t�}������1�D�u������s���p�'Z��>��B�������>���|F+�����O������"ei�^�O�`z�|?H��:�c��s|���1<\y*��Z��=O@8��U�&j��RHQ�|	��\'��K����~�dJ��v$���s�����#_3L�N���.�R�P����;y���_�a��g�o;��s3�j��w������p>�n �r�H6�.�M�D$;�b�Wd���r�~	�s��/��D}=�d#��b"��,P��d������o�J^���J�z������� c����K���B�[)�����q�'y'��9.QnA��P�z��hBp)y�;~��1)�iJ"���a������N���'35,��)\��l�d��u��^���A�j�P�����^����

����s<\����-�� {9!���s�lN��3+�2� �$Wz������a]�������s�E?�i�4�n��|�o���y�lh{���
�Nt�&�,�u�xD����C�E�������g�e>��K������ND�7��jl����r��������{��IC�L�K5�c#�+�[q��y�_�����l[Q���]^3�c�[���s���@�s)��o�2Z��K�5�RM
�5o$�i����p�+����5��3��e��F7pZF�$73��dn@�����c@Kn�C�b:�pX�_���)�UWhx���"��O��^���N�\���oN��O|�^����!oI�A���M:S��U��ixku�v^z45<�W�L�~�.HC\�MTtB��/x�}O-��SK��2�g���n��&]X�4{X4���}����[R�$�/���B��3�'�c(s�ZQ���:B���,}I��=Ui{���s)��%?��b1��&�;J�9U����6���
� �Q��i:����Vr����E��G�w$ p�	���1nsm;��"��T�4�\����Dsf�^D�@�����A�Z�9Pao��)|�3��1o�����c���j��@`���=�������+�i(�-+j��|��D�4��6M2Y�����r��U���q��y����#<u�w�������zq�����<�[u�$�a���m�\%�����D%LR�����^g��*�z9�2��u�R[�-E3;ob�8�������-�Z��;I>s�P������
���������z:7���osgK��e:�>�e���b$�\�61�V�� ,9�V�S(c�s�����a��F��1(�<f^����7���Pr?;4[k���5B	v��32��
���M����]}�A�Z�6�G5���j�������{�7`M?�@��F���M{��K�3��V6N�����WW�F���2�+�Z�y��'#���|���y�S�� ��^�����V$^�~��q@�����
�����_����@�)�u�n��6J���Fg�[|V�|��b����@��/���rO�����?�s�F��
������%.���eZ}?�j��������m����q�~����;�`-'�XC�dw���������=�z>n}����������F����s\����������L��C���a8��f����M�����X��_b�w-�5����+��q�"�3�l��f����5�,6�����\���_�~��;�����W�[����;	R�Z�i����3��)X��PH��I/\�����,l��P�9Vx-�6x����
7X�e�~!f�������i���(g�_q����:S0���g����^-jPw6�t�^���_���C���������M�:���r�}7�yh0Ih��|���{��<�ZX�Ln��e���0��yF������	99��~�$����H��	�W$���/���:�y��_���^�1�6���!���z|�
�2��%(,�=����-�f�p!���Z�3��u�=�\�'2K�f������^�����
����3Sys���=���^
���Z�P�7�������=(�6�Z�R�m��")���hC��t�����.�\�G�*18�S|��b���<~�!_���(��M�^�R���������ex<b�F�1<�+-\���!�9��K@LbX
�f�����?�����j�L�.��)�,?/u�(����O-�:B���R|w_{���?���:?��V���9^�n���va��^�����,��P�Y�&q](�X�*�I�p7�C����z(H�Oa�u���G����!`����H�O��a�9)e�����&D�01(`��eA�.�E�o�QXn:Q�7a\g��E��~�l^�?�&��x5#��
TOx��r������.�Y5���Y������x�y}�D�|��Z�����~��%y�K>?��p�����Ch�Ym�F+���*zh�t7��h� �I�Qx����/$_��l�Y�B��}W�!�|�p�H^f-�}��ZD��	%��u�Q{ww����Zu�f��Ir�����0=��^�J�d�y;Iq�mi9�J�!(F�deh)�'���4��[/�T����O�����z����.��?f7���1��)������X����i��o��	H�(���n�	��P�\�p����7�k� %����A|31�
e���or�NZ�s{R�-6n����e�����~��q���*0*�<�X0uV�A�Zxj-��(���O)�e�J'W��J��t����3�
����H&r��8`w��`^��}��f�o�#�V&��s}9�+�*�A��
��la���rt=Fu
�$���ug�NRso������M��������y�L����?8��L2�	F�7�z��#m�����>�*��(4���Z�Zb�5E��._C\	T9�	�m�e�����	C}b3�O_W0�o�^J���a'��Xt}R�BF0�Rz���WZEL��F�B�%1�po�	���g��q�.�.�'D �]��K�R��s>�3h��� ���Q�
EbL�CyyF�8�zZ����P8K0na��8��e���F�Qj��<�+���N;�qA���%��'��������
a�N���!S�\�g2]p�F����fxd��J�-�f$�Y+"[2��P���M\�QL���|t!��l�D*���^
�<�+.�>M���M�8y|�i�zx(�o=1Y��������5 ��H�`�^�
O��l����N��Q`��7�;��.��-�����{TG	��4�;>`��wVF�dB�8�s9D=A��I�@�oB���1�6A����|�GG3�J1I�A\�c7�2,3������q���G���L�b�m����L4�K�V,�,-n,��������0�\L/d�e��`?�LK�
��
�!�
�30�9���u|��^�1���cq�Y������M(���V!�M�D������)!�p}U���1�8�	%�D���U�2�����J
YC|z���n5�p��3���IrH�/#'��R�6�F�`���a�q@�
�3J4�2�T�!>9nD`F��d)IE)+�k���Cj��9�N����0[v��
�l��{����)%�	�����{A��p��^<�K�����kS!�^�9�b�����E�����B����>�v8��\sW�
�{v6Y��$M�����;����<��SL���[���I�W���1z���r��V�7��M�	��x��_����r�����r�mz�~�fB{�`�(�1��,��D��d�r��!��_��kD�sCDK5'~��J�S�O���
D�����,r
���<AO�ER�E$<@�H������p�&��<<�l-����)�5��f�@8���]��B���j�y|2H���"7c�)2�X�sN����.p�������w��S�HrkOW�E<�3����������78D��]#���?��l-1�h�^_H��q,�\-y)'KN\|t�d��ja���m,����=(�s�.	\�3T�[x��gd����v8;�'����,��������>�X�������PV��r�	1$�HN������Q�Q�]��A��+�
D�����fD�O�����>6n����m��$Z�-�y�����@C�\A���~o�#�m@���������=?�ob�������va:���W[��tr�j|S��yI���a5����,��������wPv���3�mv�p?�p�Y@��4
&��Y��>�e����h�h��@�n�V�N��.>�]$U���9�	m�0]�{�;>�#}���@�_Y�*B~������S��	�5��h���0���G��}��QI����&*k+��f�x�;��V�Ak�y ��v���rAC�h3P������O��b
I�m�a�"S��$����r<IBzQ�E2��+5qS����t1!e���F�m������8�Jbd�V���F4����$��8���p�>��W�����
I��;2�������M��|f�j�@;kvP����1F�B+�vHU�xN;��-���KU8n������P�L.x&�����\^�K$'W���'�V�B ��4������E������9����Bja�I��$��P���/��k�0P`�V���RB��V��~|�0��%������G�F���r���������	�G���A�u�������'z���H'���+Y�<�)�-ML��{T��2,��U����t{�X��tK��$"%������%��wG/�r��MX������b�pf�e�O�������g#�m��]�C�S4z��b���D���,q��=$�I��[�P��|G�S���I���^zG�����Qn���r��������w��XNN�����p���)�pxv����h��"��H�2n���M��^p����}������.�$���2��������o���^�n|�D�
�U
��+���Nb(C���
�����h��j��B��~��+v�Pq����/T��f\tC*r� ���w�9y���=q���<5�:t��W��=K�� �h6c�C�l���\�c#����L�$�a������������Qf
p��+I������g��������[��8�"|I=����g%UQM1L"�}.�W������UR�>c��;O�^������'�NC�
~�l������Q�9�� (d���uK����x���
Vn�"uO_�Fiv��U�
gi�fS��ZC	�
i�i����5Vhi�
Mt��
���C���)�u	�
����sF��_jQ�S���*���(��	��+��H�.ikS�E��f�i}
���]�c!�,}
�D�P��&�a��D���F���v�s1�e����d�d���fi�m�&h�mS�]���V��y�K�pC��e��]U����J���
���9wn������X���w���Y�8���>����XV�.�jm����\F��?��>��vpvz�w����4m��k�/�I>sJ+���zj����3�~�|�{z���"�W`�}\j�~`��,'������"a�1����TM����������! A��i�������F7��f](;��������������f�l���ds����m��&!�P���s�����K�7��d�.0�.�XJ<�Ga�P�p�Q;HUbF���&������%Yw28W3���\
(��U��rgp��F����G����M��OlE�.v�w���Z���C�1ky,��99<8|���7~>���nQ��N^�7p��}��h6�wa�}��iD~����(���C��)���6���|_�n���5������cQ����l�l��5���o7�}^��7w5t�
�����.|��T����������e���|0�����w���]vG�Y<���*��{��N$����1fBB�	�
�6���y�����YH�Q���[���yd�R�K�+#�G
�,,A�gI��1H������x�!T��$�p���r_���U0��h�������r��p���6���x8�x�[�(��w�����MO��4����w����F�l��)(�������v�����m�oj����\,Cc0���\��`|���P��K�uGc�j�o;o>�:;8:����=-f]s�Ua��^�AS�.�t��b>��C��*���}>�?�pr���hA�\L�-�dE)Y������x��j�6��[k��X�Nq
�J�n��6�p	�=��������H�W���":~�9�p���>[�/(d#���|tj>$�W�G���8��:�K�`�=�
��`b���bJx��5nM��H0�(J��X�,"��k�jU]/�Xt)�\�s��%27�#�IL0*M�;V<��H���d�c�(&� �r��p�
�^n2��G���D�!;���=�����<�������_��)��(��p�ziC��W7��&���	�����6���c``<}W�l���G������0����r��/���;g;��|�{wz���� �\`]i��+�����@�?��2�����ZZZv� U����p��e���iz����)�pz�����`�����$5��Zl<�
G�����!����<�z������	uC��	�����;���X�+�8�h#��G�ZNO����1�����R��������|����h��W��7���K��\Z��!��D\�M����L�\8
���W�({���1�D�EVEF
�B�h�H5O��z�
<��T�|E���e�T�y?KMB)�yMG�8+	���
�}E�C��Oj��`�E�o�:�����yEy����)��r?��t88��@K��k�.<���y�Z��[�����]x�����-����j�F+�o{
/x�<~�
D2��1�?����.�����r��oC5*>M)��
������w��H�
~hD��t?w�4:���8?���<���Vd�5�b
���D�mc����'���3�>x����mxK/+�3�e��#b�M���'S���G��W�0=���BUb>Q��{���)[2�p�hn�H�U�P�m�-�)J��K"���
���w^�
��kVGj�pt~3%@�������Et(����hekK2"�9B)����_���,���.��;e���\�h,��*���4-o��������8��:~�R��o�\nW��V��*����/���O��P�Dq�Q��M���G9��<�s���9�eB��M���AG��b�^TsS�j6_�z�M#�Y~QujM�`�_��L3�/
G���x�@��,�����x����6pM;�x��A�p	T����D+'���
�Tv
o��y�$l%�j4�y�D���mwVW��^���xutxxp��(Z��7�������}{h�so��������Q%c�1�H����7�oA�B�����{��}k�,��DA����3�����<O&���ynd ����Yh�9s��Q���^�h�	{���y���������u�n�g�|5����d�|��u��=uV����|����tX�����_�^x�@oV�5��D�pr't��p�=�J��PF����*������a�������0�.
{MgD)v�������dJ�W4�q���f����O�A��x���6Y��>�r;�?�\�`��;w?�Y�����&Y��u��X����\o���goHV������g�o���?!;��}�X�`EQ��������zeS�Q�V76g���#��{�W(�:6Ed��lOhJD�����\\�<�������4U��@��4Cu��`�Dh��r��&�!��F�V����Q@�2�����[wG�K�V,B���T�5z~�O�%���8��ye��;�������������W{h`=�:zg����P&���"�
~wp��A���m��]������K��n�=��'����h��F=~{��yut��������.@���_��������_��G�k���QB]��7�]�M���A��T�������r	�T<u�r��k�2K;|��D]�pj�����b�i^�;w^���\lHe�L���=�]J�5��4���������pFwYGIf6^dl��@!�
W����~u������v���ly-�#��@X+K�,]�����
.��I�����/���gS�*��zf�b��x\�8�����X��,�J�I������~���C�se1��\��~����
Ft��Ya%�@r$����sB���X�����e3["S>A~��'\�:������*��[g&����&�����q�_����W�0m��yp(-�H�:���j.�*eo1�o�F3����;S��pOp��]���W���bT5J%Yp�����q=��T��u�'��\a�����=��x`s��2[�{��������~I���ud��s��%L�^���N��N�F��\�+s/����������������t��N��y/-��Y��u$7t���Dhs�Zc)B�{��t�<S�=[��)�-s�����v�D��\�8e2#�,k0�����Q7�8�q��=�E2z��'o[~F��\�cs.����l�f�2)tUi=�������H�$��)��07Q������
���9�w]��I���X_�n��F+����Z�y�����wB7o��4�'K����n��9��)Yn�F�����1������ g(�M�B�9G�T�L�t�$<�����Kx;:��Ya.�����A������5�^V�!�������`(���7�������o�6���H�'�&��@'=�����C]E��#���h���9�0?$(= =�+�]�� .���
���Gw6���l�eJ�'���b�	�F�P���~/��6jA��D�Zs��+����\����XK,��54�F���]T�mll��q������K������sbOrF�}ZN�������&��/}�<�JF����F��h�����u�&P���X�"���^��!�U��3B�1��=S(����b�����#1�3m�4O�{�~G�	���U�]�f�NN=EY�����Y��Y���?���wP$���X��n)N\�q�/��:M���;���Z�K�y6v�s���&�U�]�YO�0dDV����@�&��qi�r�V�N=i�A�7��G��Z���
�J`N����]����d��_�"l����.�������Z-�LpLg��]���4ZQ����BLd���ED�Y�Uqd_���������n���R�������n��]���Y��^��|�#[�����;N.�Ct�b�w��$K:L'`�Fx6]�o�
��DG������T�X_���=�4�x�%Q�%�7;������<.z|�tk?���05?%7��mSAz��!����):}utxz�`Er���+��):�u|v"�%h�e[���r�t�����������go��9�����{��_:��y��M���^��3�!�>5�������9<��w`Iz��S?p��������FO`3�)B4tp�����X�3��g�[����]���w�X'O;���%�dX,��#[b	+�w��C��;�^��9�����A7�#����lZ_�3��4f�rs����LO�Ia���d��������D��ncc�����J��M/$UT�e~�v��?��/���u%;LO����Q�H"�%��)�p�/a����UOH���^�v�J*�BE�?S����k�l�;���7�}k����kL��Q�c1�C���<���������}^c@J�YJ��K�_�:z�$��l������QA5���������hXn$��H����\������
�>��x%"����Ix~��'��(�U{J1��+~�����+�u�o��u�x*�0��(����]��������N��j���z������,�/�T�H�w���\��%�)Rf��0{z>����~���G~��QG��1D���%���&rv1��9%8����P�:wc�����5���/�""jd6�__[�E�q��1��.�e�,�n�K�fr���D
p��s�N���������{d��������<��
�No�j�����'��y�O���+n����(�X�\�����^�-O�L�R����E|��o��V� ,�O�p�[x@Y�a�V�R���%/�V�t��������%�������y��s~�>�m\�m2���w�g�������q9�!QX����O�nn��m1�`��nn�����j�'B���V�������6��������_�����-
�9!���/GQ1��������j���������]b#�Q@���D�H�MJD
�T�.�}�9�H{�Q0 ���y�b:x������\Q�(rI�P������}���F���q��f��� �9����[O�aC���B����tY�uM�e[� �����<��TB&���!����Y�!�8b@�����Z�	�X��	[M�
U�.�x��3�5'���I����e���2���a�=������w��".{�u!wS�z�)N[��Kr�Z�,�@.JY���?�U.��M��7����u�����#���[o�������E2���7��r�9	<��h6F>�;�Q�MO6�c�HP8�n��V�Q
rjKa���6!�H�?k1��B$am��v'qvI[�	������yLA�b+a&�IG�.��55�q�|��	�����6,i��}���^���;`���#�8�-%)l6��&f� �r��G9���8��+���S-d��<CwB���j�w�I��;,��#L��QBT���*O0)�v����&V��h&m^T�6/*I��L�����T��N=�S?��`.�w|N'b��v%��uF������} �x/����>|'������I�&��a�H�$�B����G�u��$a���3�Z	kZ���^.&$�U�4P�"A�a4I��F��e��#G������
BN���mV���YI�|^0e��"���?P'G����LG�� z��i����_5����!�����������AvK� ���a:��K�@���W.H��b��piqn]�6�t�5�:����F6�!�vTl���
]b-^���^���������"�����s4��9{/T�E0�|��#Y�Z����{ec����?��A���p���4������]W*s��f���M��y6��!�F�*B���]^��y�]����A��gE��m����k�j
5}7�����l$��X���f9t�`�9;z����y���S6�������P]��hF{�d���!����M]Y[[o�w��s�L�x���"�#�[�#�T���WE�DA"��}�2v���������F�w��\��q�6l�<�;<x�@�����9��(E[��>�l�a��Q�B��@,�M�c$8�a�s|8���)��(��iLq?�o�|s������*�2�$����rnr��x�L��F���bEv���������T�����B��e�1�S����2�Dq):�n��rv���^�S
vF�
&q����
	2m����i�%�HC�w��&�W��h[M(&;m!8+�*O�>u�#R'��P�M�I��A��Nv	[�{���Z+KG��wqS{d:�%�z�u�w�
N��h��)��Q����Cv�n`��3�����E��9m-�r�!'j`��K�cPi�������%etkI������^"b=�L�����L�0	�t�t���o�����f����I��])y[v>�@6�j�GG:��Q�:F�1���:gxa�B)�K,�L�N[Tq3�}��'=�YN�
�)�%��)A�C��`������������:����_��
���C1�<$�����_��px���/���"���/�5gx��C8i[lo���.�����n�E���Q��c������|���=rQk��L��n!q�w��y ������2I��nu�$l!�N�6���K-Q%���Jl����$��K�X�� M�B�z�#�I�t�A��En��U-tmJcN����������N��������H��U��H������d@��"{�l�cT�����y������@��U��')��^�0��gWc�i%�
�c��7�>!�H�)JH+�
�B�l��F��gs��x1��p#���(���-�_�!F]:�����*�(����e���MG��h�D�
�h�j�DZ�Z��Vc.�%'���.���'W�	�V�P,�����\�|'m��;�K|��cq������%N����Y��onI�H�m����+��G����4� 8�G��Y�`���S��W���9���4���#�F�p���\���]��lo��u��f����� d�	��1e��+����Hg^�&XdQwq�+T;�4 ����e���������*gqGm�`4K�[*�J\�<.�G��7o�c��	��y��GF����P�C���7���H�`L�p\�?�s��L�@�������ce�!��
��]j�y����L�g��[i�P����d�9�?�����w��?:'oU7���k��F��0�dU��d
���<���K��E��71�\vx���i�����7��lg\G2�FK�*@�����a���_������>����k��<W�GS�u�4����\�m� ������?��9��w�ZZ2����V�W��t�����N"�!#t,HS��
	i�K�WU
x�����?��EF�z2��T�#�f N�M���$`8���E\*��~fh����)�T�*nP�k�?k*�Y�3����0}t�*��|���0�%�'V�0��o��8�����@�4��Vh.�y ��I�=oD�:���������p�_��M?4jK��]\�g��r�W:�s4���Y�t6���a��Y�|���K�U}u���q�D?�uku�0d�vX5��R��?#Y�~
3�V�����p�\\i��9�1%�A���@FTUv����J�)BY�����8:�
Ci?�Tw��`}
���,��Z	��db!'M;�
gf�G�����rR���NV.+��_6G^�#��0�gt�yR�]��Y�?8�M������;�%��Hr542![��_��J�	{K�F^kJ_1��aV�Ix��
��&���>���2�	#�/
aJ������zi\eP�#��g�N$S�����	Ix��,I!��K�,K�M@QTq���T���/O��kG������8	>�oG�N�"��3
D=&�1fY�L.�)�l;b*$_���t��"���
0�<����I��@h"�xVO��Sb/)���Z�m7����T�A'�J��0���I3QH3����p��E�n��a�k�����g��"�W�"�y�
�����hhz����Z�u��f�������,dl��.:����4�!�7�/�/��#L<��%���\�����X����d�x���E~(����Q��y�J�cZzaZ::>��ln�Us
Y���}��@
wQ>.s��b��F��|���BzR������p�Nr��e/.��Q�v},9�#5��<����H '�u
�pq�p�0�@�Y>�-6���A��q���O���z�f�ZZ"�R��0��p��;(�������D�������\��j��UDN�A��TA���nU5]��9J��RA���FP�G?�Os����c��~�\F�;������}�4�H�&	�/�GzJ9Y�y@��*0�Z���~IVv����8�	!(w����nw����Ph�#,��u�����F��]������;��g���������������}��� a	S��-Kt���'������u���
�	�������HR��DA���� ��&r�A���y�<��B��L5K�-�|��k�Q.��m�����	}��/��pSP��zm�i�ca�n���\\�Y5~,z�'�I���f�H�y����P��E�����5O��~�����[���#�4nb�����U�Y���@����*y������!g��@�I{iO}r�M�rm*'���*�\�W9'�S(B����b���M���j$N������$/������l8L�M�,k3�Ig(�������RRsB���5E;��>B>�}����O�2������H��Y�c`�����B�KY���	�48��}{�6-N���I����n:2�}�f�V����@c/���J�~N���!X1>�%t��bi��V�dk�~��Jt#��*S���4��`�F������p���;�L@��i�J���
}M��K=��>MN%]���S��$&�XG���?K��%'��t�A�8����>q-+��:��vSq�B��u��:�* K����Z����=��5����5�6�F���!���]�v�$	���2�]M���v��K�����>�t��e:�a�(E�iz@"��tK����/��7��l+M����B�R`���q�4?�^�k�.(Z��1
�'GgG���u�zp���rjj����m'[��n��3�&Kt�O�����a�`k�{�,M�	R?9�*����Z���_�Px~������XwP�����?I�	�?��
�E^�_Ro���l���u��"�=�+��=\�e���{��|O��L���Sd2��8��'�N�X�1Y<����s��o���K{���-��x���B5�H)��pQ/&~/Z^�sX<Z�y�>/���L@C�gE�������E0���0��-��j���3uc�;�^L�nB���5e���d[��^L�P���H�5��Ze��4�T��jD)J����?�/���}�K:&W�*���#8��{|}��0�g�Oau$;���[u��	���?cY��&��]%��dWU���h.����N6��A������$�7���k�|�Tgf�GQ��~�K{�@�G��P���������
�NI3X�a���������sj(�_xT
�*l#��Jj��+�^�Sb�F�_�|��o��6���~�
n���X�5�-���+@GyR����c79��;��������y��P���o�@6t���usl1v��d�s����X�(F�����@~�F��2�wi5����E'���q�J@TB�aj�G�#���w�l\���b��+-/���wz|I���O��\v�\e�M0:��Zcs��c���`Q�G�9c��s��0;5\3��)�X~�Du��d��%���?����}xo��:S!DB�W�������IS��3su��,	y0%#�4F��i��M$&A��.�c�!�����8�kD�T�9��+�������@J��m�>�s��_��@�9����4Suc��zQ��B��vT4�E�Z�D;���Q�=�����U�V�r�������8&���g�{m��q7�Vs6T	�\Y�4���4����h���x��3��������q��j�Y���Jb��������J�$����d[z"{���kqg��g���ob� �m���w�C�Y���)Dh���%=l��?�-��U���I�>G9����a�N���K��%�+`����4L[���B�������K�&�.��s# ���4��!�MZ}x�a��2N�����������4��|F$MG�Z.?���f���||S�� [�o��Z\.m��v����4������+�`\�ea�k�I��m���,�m��RE���tJF`�s#~����?�
o85�,�]�sP���[�����zu-�l�-����p�V�P
s���s����nkM�����ZD��h}r.aZh���E�Dx!����1M������N����j����'n�x�=�M�e{Z��es������>�L����H-���df����
K���!\`&��	za�L�N�<��
�!�G���v�7n-��4�Rh]���1<���h��G��x���OK���z���jf����@�
?��o��G��$N��7�8��;�	�x�~�?)���o���*�b�x$	����4+.�+@��Wz,�D��2W�c�,B&"��B����$4@����D{���m�l��Z[��$��k�\;���\I��%���h��Z;L�Id�����<a�=<Q\�k�#�#����; �A�s���T�J2�1��Z��F���������R/��9 
3-�6N�A�}J����)��lAr��L��?�+jK��������rFo������A+i5(��ZB�����sO-�ws�6q�'����]��]cMlZ����YX�:�"��3oq�W�a��|�b����W�q���K��a;�������/q�O}��N�Q� �X����]�������W���z'���[Lt�I$���8u���:���w�h&iEI�1n���ex���`;I��m��9D7����>���fi�v�a�$�K��#m1p��%��aOJ� �	�v��JAh���#wo�^�A����0\�;(3Pq�Y�+�*�c]����N�66��X�]��O/����uwKs���|E��0p�v���O)��z{w�sj�3�hUP���t��I�-Y��?���x�*
g�h�v-}��i��G��VI4�`\c��F�L�W�&�p�W���<v\��� k<n�w����_��J������$�;��1vd��su�#�]�1l�O��Q��{&nZ��v������dI��3C^F���9#����Q8���NG����xZ<�p<z���<">�^: P�.��<�s�-^��z,Qq�`*�P�
��������R��,Z�r���9l����a@�	jfe����2���6Wt���?�=E��d�������/����i�
�R5���#A�6����V����jS
������0U(&�~=
,I���sJ:�I����TB:���"Tb�O�1���Is7#f�>e
�|N�M�\��o��EFW�K��!bU�������-?4���M`j��A�3���N*��K�����*3R�1����`�4������,Txs�
��*���#��)�)<1������7f�01�S�W�kM����k0��v�:0]P�Q���
���z�����#���������.���=��6�e~�(L3������r0�)~��g�����K��8�0���
���[���w^&�������	����}�m:�u��'EB���R���
-�WG�;�9{�E�yL����tsw���h��.��H��{���}�DT����������,h���1�G��"�����`xg�E��(�.�-����"�C�t6%��RP_���D`��!E�E2��,��8@�D�'�-&�����{��Snl� U�Q��%�q�1t���OL��V��pg��$d��Q:s�vW�0�7,���H��v]�9�	�}<��������%����l�������}�p��w������p�[����A|���E���l"�D����'V��QR@��bJ_n���r����C~���s������-����#[N���t����^�5p�`��DO�g��E
S����b��1�Ky��[2��E���t�_�t?m ����SL����������e2dR����>��b�����i�����uz3��5��1�fV��k�3������+Pc�[Q�~9�
6J'��Qr&���
��4z\)�8T�S4qM�����l"U���HX�-rk��Xc������(�����
�|��`%�3�4��Sn�Nd�h5\U�G� �&��+���~p�9�������>��!�8UC��(z3����'�������2�y���Bo�E�I#mRT�=�<_IV�U�v�������f
)(�R�mg��{0"S��@4|r$f��_���%�.����\7�M��=�E��c1V-?��w�����RA�:�6�a2��#{�0��w�x����������������r��W,�|\�J�~X�����[�A�V�~8o�������(���JR��-G�R����t+8��-c�Y���^�+�k�sPZY�l$�)��l�s-�W�}88<����9:y�w�	����fT�5�1��k�C��)PjD��VX�O�Z���-`�_�����o��������� �sb�����\+�G��{�����]5�������^��8
�L��Q��M6��o�;�sW�L����;
�m$g~�A"`����{�M`��I�V�2��	�u#�b^�=�H�@X��FW�����!����aweA��~O/������������O9�q5;Si�K�>�TB��?
=])����P��1���'d�J�%o�5z gW�9�����S���HNV]�hik���h7/���bL���Fd"�3?�������4�$�N�k`���VD�5l�0�����=���[V�O������x�5a26����C�]��B�mDM,k��2�������5�W��|qJ�<QgQYT;O���UE������^��1CYu��67�%���������G������e!?����n�H�P�m���og�{�C'������������������0��%#��4��W��g���T5t�������R$$G�fRX.S�������w�����v^���
&pkk��e�'���N��S���I>���������/O����1��yYy�kn����t>�#��'�|�&GyRsN{���W����0�P#��D���eTx1R�'�������Fq�]�zi�����%���0E:�}�a���p*.����l�3����Yi�pC&o|����� r���\fS�)�|
���3`�BM�\5cq�o�0I���"��n�G+[���M��F�������D�L�h���O��^�����7��&���5���I�B�Ih����J��0Y|�Xw�@�������p�F��Cq�������G�J
��B����P�(��U�����\���pC����}L�\q�m�*�>J&MNy����l��4��~<fV��'����3<��S�!&��m��U�������b:����eA0�,�9��?X
D��q�MJ8V~���P|�\^8l�6[��r�A$����`�<>x�a����g����:~��)��\(�b`�H����gnK��7=!�=d�2����i��pN�>K�[V�f,+�![���m����U��m�������gt-!�G���QW	����
�qP��
�����0�B�r@��`�/�Re�
�#�0q��'�Sf�N�m�i#`&�����p�\���W�Z�y����|��(���+]
..��fQo��Zbkd��#�npM�@��������$3pG���k`n.i��� .��xi����)��=zUf���+�
����WM
Z���m�n��SXN��#�!F&�eVMBS�j%�;�3J �H�+h�l3*;=F�i|�]��>���oXY"?P;�����0��&���6?���8��y����9l>��nz�c������S^�]�X�ps
��X?U�x6�LOl����)N���G$J�WL��C]T��K�����j`��n�T�n���I{��o�'7������1�b�a�T�
d�Q����+�#�v�����::P�l��5�V��P���bw�p�
_<F+�U<|�V�g��U���9H�c����C�8���P�7�T������v�#�Tt�f����#%2"W��\��+$C�h����E��N�q�������3a�gc2�
���F3..���'�T��0O�rqQ1F�	k5PGP���}���5<���b�MxpCe0o����>��B�S�v {�JW��i�_@��]jG|�8��\�gGX�e�u�����cE�7/���I��a��f[�O�@7��6�9U���Uwd��UJ��(.�c�J����I��`n�I[
��&�����)��Td&�&���%���Y	q��Eu��S.��u��^���&�X��I��!����41�~�$'�]���:n��������C���1�'��V)��c1~�r@��b2�8���jr{���5��8����<��1s������W%���L����T����;fnm����.��ey��B�,����T�d	�W�SV�H�
�V������z^)>�S��F��ohq-�����"~<���n���7�e����6�}���@��&G�ZZH�:X����zg7e��]��+G�q<�����Oa>e������9�������,��a�>r�)a2��������H{.:7���D�O�6�����o%�k��|	?^��O�1��faw���F��]����<,--�
�[q+�����*������3��������p�l~Z���{���d0��
���7�`�<�f�$�:#�����g#$�#����c�H�c{P�
���8Q�7�g�	�en���|����;��!y� ���"��n����m���$1���%��C�.��E*2��M�������D[�F���"�E����)�<��$;���n���f������TW�DJ���
����@�]�L���K�J >�~A+:����u�!�����d3��pL-�
�(�s�\�*�FA&�i�wW�rY���N��r?�A��Cd9k��K,�������X#����z������ �����$�=�^O�Z9�Fj��k���hc
��'��VV2������c�o,J����d�|�������5h�Ah��$��Ra�84���H�g�������\��%�xC�)�����������p�����:z�>�R���'�3}��MD�U���,���mfz*�3��x�����i��;��o������OB��d��w���������-��c�����H�v6��[*�C�=�{Y=)�o�/~Q���"��Z����;�"N@u����s�- ��h���6�m��v����emIve�z\>O;i9x�nw�n�,!�w,F��1�z�6�1��zK)�|3����fM��=�S�����O�Xx��H�u���`����3����SJ������DE���k����U�i��sH�=+���#�OQ��|�}TwTav+
Th��d:goz��E����ys��i�]<�K����g���0�7�[�^�����U�V4�IB�q[���I����JA���T@f���gI��2�PM2��-/��;�]��w��A0��u�����_>�u�;�wJ4H�#���h(�E^���&��4��5������=��Gm{�&��d�IZG�����]q�/�$���%��~����^�����9��y���S�OQ.�� d�����?X�W�G�~�G����{/��sy��Uu��[x�X���%9��?�3��)(��dOF���xrp�����4���n�����p�3�
������Y�q�<��1/?�|�Tp��
�f^�����B�v����"�J�������w	�y.o[��O%�r�p�����Kjt��|f���w�c�|$��R.�WC`�0��B������WW~�
�m����5BK�������G�fr��&5�t�5u��&�FZY�*+�UJ��a=�����8t���
N���������]��������V��(P���e��E9��� ���,�{��Z�2�}Y;Kf����V=�F�(V�e~�u%d?���yB��xBw��Z���Y��%�����;����������x7�g�z2M��;M��&��������N�J�\��m�J����Q���I�m�����;g����N]L/���`��(���j,6sPM�	c�4
6�u�1�j��~U��U����[���]���������m�.F��]h�\%���:�YhX���M3�*H��������.F���h*I��pjy*�v��S)q9�@a4�Y�$�D�d��
��%�,l��i$[ ��(Ef���W�^C�uM��,P;Z�)KK#�k����@��+~�EG7����'����TcF������e�"�����W��G��8�-H��s�N}������0������X�����D��x6aq?���Bc�����6�:a>w�Nm�i>�+Z����;�3���n�D1�t�+4:�K����*gvv��H�q�L�
����<��q�#���Z�HC
�u�\Y�)s�$�_8�g&���P�|MR�D����jpA����������Sic����L��1�)�+�����5����f����*�X~�r!�3W1�C��>�x�������x�����m��C�5v@���g�3�8#�q���Dh,��CD�2�RA��8	v��5 �<j�a�$>H��F��9��VXA���%�n�w++��'�}���5������__���9c*FZ�p	��e�\���sTzU�������6{�?5��V����<�����N���K���p��XSj�"t��������f�sj?���]&n�9s
�Z����v��U#���}kC�@^��vi�/���>�U4���n����������9����Z��8�%���^ze����Q���7:�����c��YZ���i��i���;�'�l����u�E��U��L���.�Y��~�����b$p-���m':�G�4���G����w�XQ�hV�%��*=��}��	�o���E(�ip�5_��MB�w����C*���&�\����x	��_���\�
����	���qm��Uz�\D�|�����>
����(���^�'������F��l�\^����%�W�����
�s���e��S
`��
/������:.����0d����u/��|�T���AbQ`�f��
���$P�
�}��l���o�)E��T��K/x;�z��t���,��}����!�i]\24T���.,��+�$����_��������(7��^���h���.�Cu��F�
��m��&���2�*����elD�z�tJ[ib��'8ZI
�)aC\�[����[����Z#�_`����0=���khI��Ky;I1E���6�IrA�q:�0Bh�l0�!�S����� v��?E���A��o{�<���g%P���%3{7o�.C �!L�{���p��g$#DRR�����2y04��\"���2������C����UD���d2���s�DW�Y~i�T���Qvsj��x���[���ar�d�����d2u)|��`�|N�
R��c��T�C��&,_��S��,-��I��C�-��iJ�������P({0*[l��J����Y����8��4#$s7(���e����v�L(M�]�!�^|�(Bx�po�FRsY�)_����������y������\[-�����JJ�/H`z��P@��y��w�����
9)S�9�>
��X1=��H0��d� �����zXY�����������{	tUE�����=���=\�$�m��w�s���>��'�I|�d�ye(	�0��qm�frmK2���3:���9�����/��3����xn�HG���a�@Sh����t�SC����������� ��ju�������	%�Sx#1w�ul���Lg���2f�]g��?
��Q�%t�����
�<?G������>���:]<
H,P�����u�d�����p��9�/N��p��wr�7�w�vo����\��w���wX�x���~��9gV�z�;0A/��8�/���uB��WW�="�QI|�wY�A#0����=���D'{����GGg?��D�g���m2��3oB�{�����x}��j���v�v�o�T�B���24�d���%�;����%�����(KKX�y��r��Q�d�0��kY�D<��7K�C&ri��(G�����O�|a�I�������\����}1/������[��\N9�D��H��dCm��o��`C�o�7�w��7TuK��U]��-��oCFU�r���T���|���x�j��l<�^�/p}�!m�au}�S.Z��&� !l���.�-
d
��IZ�(�@Gh�sI�\�3�����;NQug��wU�X]���E7�`�����W6�b�K�x���ip�8(0�;���B.R5���Q���R8mPG>�N���9@�t�G��Wi�V)�c-��"�U��i�����#+��`��,���msG:��64���������s���.
<8��nW.���NdD�3�4���0�B����{���!����[>�C(�6m������h�V���n!�C0�Qs���\�M�1\�3�e��W�!Fp���SLIon�����G�nf�kK�Cr��@)��B4w��k�� }�Kq��=uE]�$������&K�s�M����2��Q�u;��4���*t��m�I�����)+�&a��P�v�12�������12������;x|tr�T�=#�����8��1�3��S�$��.{HR��u���O���z��q���?���b��GtD�qY[��[%U�#��c������=3��'�K��'9�GI��}��2�`��d7=OUc�����-i�����N�8��d����B���Wq������l3�gD:�GL��z���k�{`�5f_�n]�V�EJ%d�'����Lk��L��Ur��b�/HDI(-s*\\�%+��~e�r���oJ��25���;��Q>I{�����	�W���soh�����c�k��*����{����u��]r��X���������
e�4���ixJw��~K�t]���"pmo��6�~��q��L�6���&��UY�utJ@?I�9�u<���f����%��
��TIW��$�c�"(Ez��G�,�����O1�3�2�/�������U�;�
q����G������������^���e��HS��@��P���\c�emy�2:��2�;�2M[[����%T��S��Nx
1j�;E���lz��\q�~�/?���y�D�(��������+��wk"���e���s�D��Z��n���s��)%����y���Z�N��������?e�$k3v6���?��y�O�/���"E��&����
s�����um�D����O]�l�w��&�Sf�=�<����ni^����k��}�j�/"78��[�L��.L0�
������q��t������\T
#2Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#1)
Re: Switching timeline over streaming replication

On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote:

I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby
server that you'd like to keep following the new master server, you
need a WAL archive in addition to streaming replication to make it
cross the timeline change. Streaming replication will just error out.
Having a WAL archive is usually a good idea in complex replication
scenarios anyway, but it would be good to not require it.

Confirm my understanding of this feature:

This feature is for case when standby-1 who is going to be promoted to
master has archive mode 'on'.
As in that case only its timeline will change.

If above is right, then there can be other similar scenario's where it can
be used:

Scenario-1 (1 Master, 1 Stand-by)
1. Master (archive_mode=on) goes down.
2. Master again comes up
3. Stand-by tries to follow it

Now in above scenario also due to timeline mismatch it gives error, but your
patch should fix it.

Some parts of this patch are just refactoring that probably make sense
regardless of the new functionality. For example, I split off the
timeline history file related functions to a new file, timeline.c.
That's not very much code, but it's fairly isolated, and xlog.c is
massive, so I feel that anything that we can move off from xlog.c is a
good thing. I also moved off the two functions RestoreArchivedFile()
and ExecuteRecoveryCommand(), to a separate file. Those are also not
much code, but are fairly isolated. If no-one objects to those changes,
and the general direction this work is going to, I'm going split off
those refactorings to separate patches and commit them separately.

I also made the timeline history file a bit more detailed: instead of
recording just the WAL segment where the timeline was changed, it now
records the exact XLogRecPtr. That was required for the walsender to
know the switchpoint, without having to parse the XLOG records (it
reads and parses the history file, instead)

IMO separating timeline history file related functions to a new file is
good.
However I am not sure about splitting for RestoreArchivedFile() and
ExecuteRecoveryCommand() into separate file.
How about splitting for all Archive related functions:
static void XLogArchiveNotify(const char *xlog);
static void XLogArchiveNotifySeg(XLogSegNo segno);
static bool XLogArchiveCheckDone(const char *xlog);
static bool XLogArchiveIsBusy(const char *xlog);
static void XLogArchiveCleanup(const char *xlog);
..
..

In any case, it will be better if you can split it into multiple patches:
1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.

It can make my testing and review for new feature patch little easier.

With Regards,
Amit Kapila.

#3Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#1)
Re: Switching timeline over streaming replication

On Monday, September 24, 2012 9:08 PM md@rpzdesign.com wrote:
What a disaster waiting to happen. Maybe the only replication should be
master-master replication
so there is no need to sequence timelines or anything, all servers are
ready masters, no backups or failovers.
If you really do not want a master serving, then it should only be
handled in the routing
of traffic to that server and not the replication logic itself. The
only thing that ever came about
from failovers was the failure to turn over. The above is opinion
only.

This feature is for users who want to use master-standby configurations.

What do you mean by :
"then it should only be handled in the routing of traffic to that server
and not the replication logic itself."

Do you have any idea other than proposed implementation or do you see any
problem in currently proposed solution?

Show quoted text

On 9/24/2012 7:33 AM, Amit Kapila wrote:

On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote:

I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby
server that you'd like to keep following the new master server, you
need a WAL archive in addition to streaming replication to make it
cross the timeline change. Streaming replication will just error

out.

Having a WAL archive is usually a good idea in complex replication
scenarios anyway, but it would be good to not require it.

Confirm my understanding of this feature:

This feature is for case when standby-1 who is going to be promoted

to

master has archive mode 'on'.
As in that case only its timeline will change.

If above is right, then there can be other similar scenario's where

it can

be used:

Scenario-1 (1 Master, 1 Stand-by)
1. Master (archive_mode=on) goes down.
2. Master again comes up
3. Stand-by tries to follow it

Now in above scenario also due to timeline mismatch it gives error,

but your

patch should fix it.

Some parts of this patch are just refactoring that probably make

sense

regardless of the new functionality. For example, I split off the
timeline history file related functions to a new file, timeline.c.
That's not very much code, but it's fairly isolated, and xlog.c is
massive, so I feel that anything that we can move off from xlog.c is

a

good thing. I also moved off the two functions RestoreArchivedFile()
and ExecuteRecoveryCommand(), to a separate file. Those are also not
much code, but are fairly isolated. If no-one objects to those

changes,

and the general direction this work is going to, I'm going split off
those refactorings to separate patches and commit them separately.

I also made the timeline history file a bit more detailed: instead

of

recording just the WAL segment where the timeline was changed, it

now

records the exact XLogRecPtr. That was required for the walsender to
know the switchpoint, without having to parse the XLOG records (it
reads and parses the history file, instead)

IMO separating timeline history file related functions to a new file

is

good.
However I am not sure about splitting for RestoreArchivedFile() and
ExecuteRecoveryCommand() into separate file.
How about splitting for all Archive related functions:
static void XLogArchiveNotify(const char *xlog);
static void XLogArchiveNotifySeg(XLogSegNo segno);
static bool XLogArchiveCheckDone(const char *xlog);
static bool XLogArchiveIsBusy(const char *xlog);
static void XLogArchiveCleanup(const char *xlog);
..
..

In any case, it will be better if you can split it into multiple

patches:

1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.

It can make my testing and review for new feature patch little

easier.

With Regards,
Amit Kapila.

#4Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#2)
Re: Switching timeline over streaming replication

On 24.09.2012 16:33, Amit Kapila wrote:

On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote:

I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby
server that you'd like to keep following the new master server, you
need a WAL archive in addition to streaming replication to make it
cross the timeline change. Streaming replication will just error out.
Having a WAL archive is usually a good idea in complex replication
scenarios anyway, but it would be good to not require it.

Confirm my understanding of this feature:

This feature is for case when standby-1 who is going to be promoted to
master has archive mode 'on'.

No. This is for the case where there is no WAL archive.
archive_mode='off' on all servers.

Or to be precise, you can also have a WAL archive, but this patch
doesn't affect that in any way. This is strictly about streaming
replication.

As in that case only its timeline will change.

The timeline changes whenever you promote a standby. It's not related to
whether you have a WAL archive or not.

If above is right, then there can be other similar scenario's where it can
be used:

Scenario-1 (1 Master, 1 Stand-by)
1. Master (archive_mode=on) goes down.
2. Master again comes up
3. Stand-by tries to follow it

Now in above scenario also due to timeline mismatch it gives error, but your
patch should fix it.

If the master simply crashes or is shut down, and then restarted, the
timeline doesn't change. The standby will reconnect / poll the archive,
and sync up just fine, even without this patch.

However I am not sure about splitting for RestoreArchivedFile() and
ExecuteRecoveryCommand() into separate file.
How about splitting for all Archive related functions:
static void XLogArchiveNotify(const char *xlog);
static void XLogArchiveNotifySeg(XLogSegNo segno);
static bool XLogArchiveCheckDone(const char *xlog);
static bool XLogArchiveIsBusy(const char *xlog);
static void XLogArchiveCleanup(const char *xlog);

Hmm, sounds reasonable.

In any case, it will be better if you can split it into multiple patches:
1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.

It can make my testing and review for new feature patch little easier.

Yep, I'll go ahead and split the patch. Thanks!

- Heikki

#5Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#4)
Re: Switching timeline over streaming replication

On Tuesday, September 25, 2012 12:39 PM Heikki Linnakangas wrote:

On 24.09.2012 16:33, Amit Kapila wrote:

On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote:

I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby
server that you'd like to keep following the new master server, you
need a WAL archive in addition to streaming replication to make it
cross the timeline change. Streaming replication will just error

out.

Having a WAL archive is usually a good idea in complex replication
scenarios anyway, but it would be good to not require it.

Confirm my understanding of this feature:

This feature is for case when standby-1 who is going to be promoted

to

master has archive mode 'on'.

No. This is for the case where there is no WAL archive.
archive_mode='off' on all servers.

Or to be precise, you can also have a WAL archive, but this patch
doesn't affect that in any way. This is strictly about streaming
replication.

As in that case only its timeline will change.

The timeline changes whenever you promote a standby. It's not related
to
whether you have a WAL archive or not.

Yes that is correct. I thought timeline change happens only when somebody
does PITR.
Can you please tell me why we change timeline after promotion, because the
original
Timeline concept was for PITR and I am not able to trace from code the
reason
why on promotion it is required?

If above is right, then there can be other similar scenario's where

it can

be used:

Scenario-1 (1 Master, 1 Stand-by)
1. Master (archive_mode=on) goes down.
2. Master again comes up
3. Stand-by tries to follow it

Now in above scenario also due to timeline mismatch it gives error,

but your

patch should fix it.

If the master simply crashes or is shut down, and then restarted, the
timeline doesn't change. The standby will reconnect / poll the archive,
and sync up just fine, even without this patch.

How about when Master does PITR when it comes again?

With Regards,
Amit Kapila.

#6Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#5)
Re: Switching timeline over streaming replication

On 25.09.2012 14:10, Amit Kapila wrote:

On Tuesday, September 25, 2012 12:39 PM Heikki Linnakangas wrote:

On 24.09.2012 16:33, Amit Kapila wrote:

On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote:

I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby
server that you'd like to keep following the new master server, you
need a WAL archive in addition to streaming replication to make it
cross the timeline change. Streaming replication will just error

out.

Having a WAL archive is usually a good idea in complex replication
scenarios anyway, but it would be good to not require it.

Confirm my understanding of this feature:

This feature is for case when standby-1 who is going to be promoted

to

master has archive mode 'on'.

No. This is for the case where there is no WAL archive.
archive_mode='off' on all servers.

Or to be precise, you can also have a WAL archive, but this patch
doesn't affect that in any way. This is strictly about streaming
replication.

As in that case only its timeline will change.

The timeline changes whenever you promote a standby. It's not related
to
whether you have a WAL archive or not.

Yes that is correct. I thought timeline change happens only when somebody
does PITR.
Can you please tell me why we change timeline after promotion, because the
original
Timeline concept was for PITR and I am not able to trace from code the
reason
why on promotion it is required?

Bumping the timeline helps to avoid confusion if, for example, the
master crashes, and the standby isn't fully in sync with it. In that
situation, there are some WAL records in the master that are not in the
standby, so promoting the standby is effectively the same as doing PITR.
If you promote the standby, and later try to turn the old master into a
standby server that connects to the new master, things will go wrong.
Assigning the new master a new timeline ID helps the system and the
administrator to notice that.

It's not bulletproof, for example you can easily avoid the timeline
change if you just remove recovery.conf and restart the server, but the
timelines help to manage such situations.

If above is right, then there can be other similar scenario's where

it can

be used:

Scenario-1 (1 Master, 1 Stand-by)
1. Master (archive_mode=on) goes down.
2. Master again comes up
3. Stand-by tries to follow it

Now in above scenario also due to timeline mismatch it gives error,

but your

patch should fix it.

If the master simply crashes or is shut down, and then restarted, the
timeline doesn't change. The standby will reconnect / poll the archive,
and sync up just fine, even without this patch.

How about when Master does PITR when it comes again?

Then the timeline will be bumped and this patch will be helpful.
Assuming the standby is behind the point in time that the master was
recovered to, it will be able to follow the master to the new timeline.

- Heikki

#7Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#4)
2 attachment(s)
Re: Switching timeline over streaming replication

On 25.09.2012 10:08, Heikki Linnakangas wrote:

On 24.09.2012 16:33, Amit Kapila wrote:

In any case, it will be better if you can split it into multiple patches:
1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.

It can make my testing and review for new feature patch little easier.

Yep, I'll go ahead and split the patch. Thanks!

Ok, here you go. xlog-c-split-1.patch contains the refactoring of
existing code, with no user-visible changes.
streaming-tli-switch-2.patch applies over xlog-c-split-1.patch, and
contains the new functionality.

- Heikki

Attachments:

xlog-c-split-1.patch.gzapplication/x-gzip; name=xlog-c-split-1.patch.gzDownload
streaming-tli-switch-2.patch.gzapplication/x-gzip; name=streaming-tli-switch-2.patch.gzDownload
�d�aPstreaming-tli-switch-2.patch�[kW����l~������[IKi�J���o{��,����#�q|���w?{�H#���t5��F{��6[�`0P��� S�F?�7���
G��8�����~��#7����?*������n�z��;���������YZ__��������A�omn����n��?,)�w�g��8�S���U�� 
���t_M�t�<���{����zaJ_u������I�SZy�����_]�_�t�N/�:'�7����
{K��(k)usg��=���k��J��8J��4��Ze����$q��Z��eK��mUZ�F�"�����2��M�$������(�~�x��dD`
mx��:����<v|�l����q�^~�ZZ���5� �Z��*���Q�����T�}|��]@�Y�d�e�.��}fjJ;9����y�����#��`����pf���N�J��x��,U���?(��P�}M��:�\\��`���*�tB���5\'a����*�X�I�����K7�^��ZR�K~���w7�[�do��������nOR�}=sV��D.1�^)��%3��Q�k��s���0�|��B�����h����kG��!���t~>=���v�u�o.�~S��]�c���.?����D��:����D�z����g�H���#Xq2#��$�8��Xp�{���`DR�?�N'aF�3(�� Q����&l8�HI�!��:���o#���t��XrKW��{x�PG�(����������A���j�R�e�>�So�������= ��d����2#G����r�_��2���U�p������f_��y<}�
_���E�.��[jO��}*k����~���r��<��N�f��'�9&�y�t]{#�����\�P��GQ���qp|"��������U�/�c���U�;X�|/*��H�$�����a�F�D|�F���N*E�Zx_^H|��TOk��3?D��p�N|����2���o���P���A?f�"H��UMk\1L�0d3
�	'@=m%Z�XzIL�����j0I8s�C�%X6�M+:_��e��g}r�Qf��$�����VR�>��!�4k��$�c�b<������1w�v��#��a�
�`��v��Yr�&�����{�8����J�/ �M����?%��w�0��T���V�e��2 +X[`�iY$f��5������=���0�4Q<%�D��ci��i���a�p��,)+k}PX�rZK* ��I�ZVC�Q�VYM�BG���BpE JBX!����2^J���H��4]��e���mxl�Y�E�7��%�SE��u����x�7u�u�{�������� I����#6�6�!����U�S���Xu�>�E��H~
�e<�g��Y�����u^Q��r�Wp�^G���"��
�,3������k�r����o-��6��-�}��(���D#yW�����th�Lj�g��J��Y���c�����Q��MHC����dK"�������QLI+�;/
��
�	#�r"z
�A�y=R�3z�P��3M��oy5LcKb��������������N<_��&��wjp���j�ZB4�*f��?��=?�����_/����{�?^���;o;g�52��A���Z���)2��	�{���zi���z�6_��k���-�����;���RJF�2F���xJ����Z�  ��9e��K��XZ����I)�4��w�����:�i8�����K���� 5%������:,\�`hi������r4��t�^���s���a��of��D��G2=�Bh�����|�����4>W3D������k�\�T�yB��k��u����5�hu�K�`}{!;+�����i�N��ik��v��&/���+���T���^S�i#���R�A��j��g�?�g��n���;"���I�\c��I��]�j�h7Q��GA���I��s�;�;��5�$���}OB0���4�f���T.�6�h��7��B�:h�w?��_���Q:�/�3��?JV���Z��z��g��yK�`�����,`Dk�+R0`�a�����f�Q��4pB�nY�h@v���y�x !�|�8N�����Z�4b�&��4�H�,�7E�MH�=��A�#����H#_�����_�%c�=R�s��i�zY�
ut����/�/�Z������4�Kb%y�$+��F�I��O��H6d|���a`F��%�+�:+��xJ�F�	��y7���t�
�e�}�h��H}�7�7�yExy��U��}���E*�$
T,?K�X&%{������/���DD���fG�T�����$������'�MZ�%�Ls�����:�tP�����n�h��l���o2�}�G���>����m�r}��^��E^��J�����7�'����2�G��0��-��<n0$�\1WWPi���-��G/���,N����f�j@��#���M��O����b�C��AN �F����R��bgJ�nZ�R��*XZS$��@sd��g[*�K�k1bR8�:Hh��@I�xF���������1��o
�vR��uT��|����Y��N��69�s����{^0��?���
�)����7�4�LbrT	qW�{���?h�pW��9��3r8@�k����no�4)���=hr���s$��r
��'b�`��A��p;$�����QO@�R�:������"�`MYX���G����*�rzU#R��'	U��&�p�BQR�/()m�(�)O&	(g������~06�����J*�����J�tuy�{2��t�Ih{������	�Cv��B�#�P$�T����lr�G9�sS��T0�k=<�q�~o��N���Ul����'+�R�y�g^#�V
`�����{��^7����ewo2����g'?]�_G�
��G�m�;$���C��6:,��}���������M�F���7�1K���*�od�V}/�L`9���G���ZrnG��UM2BH��9���3��K�;��uz.��	1dP�5��5�gC��������E�������;H	(����U�{����rq�����T����-����+J���wM��J��f��x��/��%�~�R���E�z�U���6�%'�����	�6�c����a�W��8��.����q�V�'(<@�����/V�0�����E���$�*�i��QRs8S�/V�:��K"+~L�N�d��'i~B�|�]��"�sE��~�f��N��2VU�@=�o��7k�#(eU��q�@�XD�,r��lrq���'/~|��R�Pl�h=�S�!���cJX8q'H��
N����1������t��^��|h��
�z���j�����`�C����^S��_�u������^�yE5�E�����������5mP���������{���21c�\��������<������Kj�����7�e�5�������[��g#b�e��=Jz+91##�e}"��*WV��K�/����@��kA����6���F;1?h���i
eM���x�~��S���Z@lpoN�\p��1bE�"���j�NF�K��&NH�
\��C����Cuz~q}yb["_IL�4t�qA�������@_Wo�>�6��)�_Ws��8��<��%�v�������~��%�#A���+���q���_]����L4A�"g�zP�����m2��|J:I���s���T!%=�L����yF�����'6�@�	t�����)cd�69�O'=LUL2��|�i��,�j8�(��|�������������:�?v�:�?��h��Ww_���p���Ud	�"��:Lu�c&����]������yN��I&�n�rEf[��6
m����p��F��J5����Cm���D�!
=���h�7��T2��7�-`���
�Y��78
J9N%C��f�lC<,�0$z.�$��Vy��go��X�|I!�IFb�x��#;���/�1�:e�a%�0",����z��:�����;�Q�)��\��L!a>M���z�J_����B����B����4:Ek*�)�.�O�NG�*���8#|����`�Wn�-��mK��d>6�.W�%��G�e��}@����![��O�eH_4�#U�D����y��X�g�����E���kI��S�K�b$��������4������1��G>�����H0�&@>e@[�z��&�v��0O�a0��8�U�YL:�E�����V�}��a�M�7-�u2�@POL��}X�t�)	1`ZH�g����{i�Qq�A�X�a?��>o6�V�\�:+���������N���N.-�������p��p���/�&o�~�]%>_����D�n_�q�j�/d�����TU,#�
�]w��@&lp<p�i��MR����<���������T%�/�,�����
:x{�AL�Q8/]x�����
�J���{��r��'\eGX�YGPFU��r�]2���Ji�xh����Zk9#g}�
9B,��-�L]QMX$�2Je����Zbl�QsO1�t�c��_d'��[�s~�=�`Vt��t����dNy,����l��6�5f�p�`�`��Vk������'(�24���s��w>o>G��]�]g�t�f�x�_�x��o��������s}����
�3��V��]�3/��+�@���/�2�l��,r2-�����|Q�C�7���8��A��I^zCm����?���~�9����nw�r�SeWy�|f|6�&}�}�-��L'��������*H9eio��X{GN�Y�s|��x�oG��O{��>K��g|6z�GC�d�W��"��+����n!q�����!��)g����_���Q�90���=��}<����A0�S�L�:m�Q�"���p�+,|���W3n#=�����]D�|��!�a�;��}��t����)�;Rj3�L�R������4N�Dvwx�mw�]�4�M����%���/,�o2LiH<���q�wF4���+��;�����.x65/�^S��k���@���o�.~��*�|��r��L�^E�m�����"��h"b����J����hF�7����Q������ rt�3Z��~`�)���������!�b�Cm����{A�6NH�n����#}��rt��-��h��m���q�/lM�YW�U�A~n#�~X��p���+.������L]�e
��Um�����{�%��bgu�D��������g>ql��0��Ip����qf�d�Gv������Bu�LR(����3sl��������<��)��;-r������.7aqz~qz~���oWN\.���&Q��POL�=�`��oo2_��w�����`Na��-'��W'��#�R���XQ��?�-+3�W$��$������5��M��W����^^HC=��z���aa{�
����v6��������^�����'h�8��vK�_�X�����H.G*�(R������7�'f��x{�F������s�4��mt�T�'�sftF=����'��x����(3�����}F�����P$���D���~�sc�Z��|��=�E.���yi-�$#���i;���$�{�t�[�?X-+
�]8����4�h��^|�/���6�@��vs�����/���)u=�M�����>�D���0��Kzd-Yk��������Whu<8��V��+5
�w�E
�\6p\�����9�${��m8���O>��m�C$Sm�������y����0K���h�$.�������txWb�=�#���!f��40l��^W��0����c��^S�G7�+f�
,�B~�I���0����QL�y�~�����p���LN�a�����I�������(y��
�,R2;3,Ez�5j��������Xz|�P
4*�V=�3%1 �����y`�9V��N�a���\Y��7�r�G1�>�{��F
��x`��Lg)��K����*=t����\���s�Pk��Q��q�S����5K����'�`���$�i���/����N��_�\E��g�zi/����	J}i���`/U�&�h�����uU�9��_��w��d{��?E�^�
f�@�UvY�m��%����W��A�H���H����|���E����~����$eFD�z������>�M�q�_s2�E�OG�?��iX|�#����Q{�����Ncg��<e��A7P���	������H�sN2��A���<���>��l��]0����T_"AK���[+�)X��8�b��,,U��}��-���u�9�F����%f-k��D�I@������DO*z��1o�,�aM;���NP��z����P���4?��q3k,��E}���)X	6�N"'���oHtL�s���i������+���uY,g=�D�u�{���Y�Z0�t=��=\�X��9����]�JR�m-������_�
�C	����,q8d�����c�G�Csd#P'�q�e���:�%�
���m	�����H���xl�n�d��1��!��\&�$��I�����L���IK���(�lR*��&�x�D�n�����cK�7��R��J�xly+��:�G<�K�3,����Q�����m(j�7�y|K���;�z��n�����Z�p��#��i>v�W�������8;���^�J��)*O��G\\�FK�1[[E��J`*Qx�Ack��/�d P��G��8�W��
�,����-����H�2�t���O�0���V��o��C���M�j{�v�6hn������mE�f���j���*M�l)�Q��M�}F�,
(O����N�����>,���������\�������f�>��@_w�����qA�:����:��K%������"t���e(G�f0B�1���"y�R7��X	"j@���m����Lwj\%�>?<$����S@G�t|9�n��rFE�o(����% =�Y�����$��g��\/E���#�m0��L����9��1�M��
�~S��R�j�D��?��b7\U�G�&
:w�	�Hx�\"��;-��|e��f�t�0�L� b�G-x=�Q�DG�r������ ��Mg0&�����K9w~���g6���A��!b��1$P�/E�E����l'1M�����&a�V����o�<��,d���jd*+[�����z���6W�D@��,�=����>��
&�������CtM���~j:"���b'�N�3����[�W�kp�1����6��@���jt�r�Qt��i%
��@����vA�Z��FQ�w7P����|�i�U������Kj,�'�D�[!A�"����`����7~p06��U�sXK����'����Q��^�y���x�a�*�^V�$�T8�%����4\�d��$�F�]t�1TvJ����^��a�N�w�S���;RH�#g�3����"��C��cX+`1,$��L�q��;����Y�(�U����m���{���F��wmC�b�)U���fQs������W(eR����So!J��n��@3Z ��aE
�H6U]�����m6��8L�?������@&7P��]f%��{�s���V�!�j��3�W����S���e�fHkX�,��r����}r*�Jj}�b�t�|�q[�u����]��gDX����71�<��BS�[��cp�Sk����)�u-��P���O@���!y�"S9������CC�����G���
�����y�>�D�C��u8����$���DSf�[���#7���n�}E3#<�����&F6{4W�)���
�r���cV�V_���h�v>��J5|<k������'g����-]hgss��n�����5>#xI������h�I�V�)������@X��0|? ��o9FX=��k^1{�����:%e��K8��t��< ��z-�� �>e�Z�JL�/��$�fLIk�H;V��7t�PLW��/����89K�T��=�:~w����;����6�q���C1���������c6���mg��!�mF�e�0p��X�"�b]���"�>�M���W�c��;�`#�`���O�8@����;o���(���jW1I+��9�������qw1�P�l������?�}�w:1E��`�qy��Ytf��	�,R�)8�z/� ��5��9��������!�+5E(YMhI 2l'cLss�R�����L��;�~v3%u���#{s��6!K��d:u��6�-��{9�MW�9o6gt��e���#�2�Fx��.D�B������w�����A7[WN*�@���T ��	 �V�bMY��
���y��^���������l4���I���f$�:��VD�h��=MT=F��:���I^��"������:���tsN�%���Y�S����A�Q]'/*�1H����9{n5��Y�_v�=�x?���}fr�t8��#|#N5������B�H\�9u�fx�)F���q)�L��r�8���h�m{n�r���_2��3���P����E����q8�U�P�/q�
Y�fz4��W�Q)m���~�x �,@�Cx���nx���|�SN�hn�T'�>��".�g�*:&M�������n�=�l�/2�;�7����0�&����=�8��
���vL�UQ@�����e��)sj[�����Z�����a���e�Yr�A�B�����;�h&>�����z
{}��`��-b���&����={A����0�!�N 
hk�$��v@2r���Y0�1���(fAg@�5�57�B��9Ws�B���.W�{	`�&��������[�[	��2��4�uZ��_���Pd<�m.X���n#�Z4�5���L+��P�'���;-Q^k�z!�2d+3{������u�N<��F���c��GG|��)mc���egZf`����LA����m	�d!8��k{�|	��U2@���|gzQw5@�L�3��{%	X�����!��B��2l2dY�>)~�SRQ�IcZ9�� C���l��=�hI����]�����l��Z&4�r&u�
h��f�F��l]��4��V������e*�5m++�\-�&v�md~��C�c�,&���:��bRym�X�I�����e���v�|s����Qlm[���Lbd	[�������cI���e�a/�~�%T���J�����_�w�et �/<����_�6��4F�����b\�F1������Bd�\��oS���A&0k��M�g�0{�����F��H,vW�|�������j���bx���q�}�
�X���O�yO���#T9�5W����A��Zh�����,_�f����RW�ClS�k����|�UJ��M
���EJ"d?%�~8B�2x� 3P.����H�4��,x��e��r$4N����Le�L�����u
L����D�K���_��oO���3+�Y��IIF����M��gJ�
H>	�0�S�{����E��5�F�X�E��
E-]f$J��2@&
���3��p	dc�Ot�J�UaXh��5Qs��\c�� ��q���^9�'�7 0���G']�>��Y��b=��E�,��o��b����2���^1fQ�5z���m��������n`X�����3�4_���{���`�X�u
WIs�ma��j�������jmS��%$��2T�)H�H�mE?�����QW.[YA���b��L3�$C�J�3� �:]P��n�CS����e2`��7�1\�)- F4AB�U�)Fa
	L�)/���;��&p70�;����L��a6g^�?U�4��h�W�	y�AP_��@x}_c�2@,a�,���u�a��,L�I/������'��o�K����uLy�<i ���Mr���-��,wb��6���l���f��V �,��>a.���p����O�A+e(0��{��b3�
�1��'�z��C�j��u������u��� �����w�v���A���{�s/�8ij-�8)Le�~�����'�I�'�B��F@x���r��m3E�9()/��K�V��:At+��:��I6L��NZ���P�^�b�C0#���5��9�d�����Y������n7��I�a^I\��z�G�/l��L��o��=?:��d��^�o_}8w�hI6��g Yi9���.;P�3+���b\�'ty�FV{+'�)z�J&��m�V�z(TS`g]���YQ����4�
h�[�C��(�0w�f�= \<�dT�����3s�P�.	J���lFn��[��,���6?�0$��E1��XJ���)!�=����4����$�h0�����������v.�v�Y���� �RKB�u��#7
8}��h�-��e�	��cs���c��������O�_�[�]��
-n��.AI/�/
�����Ah� �Z�6�]?	V*6���v���8o87���fl��h���S��ye���A�k
�����(��<�\q3%�N�)��?+�����Z��_\�3��n;��)��i�QtLyid�P���c`^����L<f$��"��,�������v���]��m^��t��"������CDD�@������Y|P�>�0���f7�b$�2��P���J~����i��z�PQ��'�0[�^93�Mv,�4�~#C
;��DQO	>��P�b���'�l)�{`q
l�*��-.6��#4��$SD�*vKA��`������pi����/g��at�V��m�=+���>��O��?�'(�F�Aj���6���~'���������f�.�o��
z�`�n�V��6F������@���t!�0i�v`r����~~5�����i%g�l1a�Wd��`���V�����M���L�6m���~��0�%�^�������E����L���S�%���sI|l�6\�����m�_�	����'����Y�fz������;�E�^�������������V!��@[lS$*%Q�t��5IB�������q���G/���g,�)��V��Ay��
](�%1�:
D����
�~4�mt�j#��c������C84���o�	��'���f�qc�����s�]	�+���l9�7W��5��;!��������:Yc*�$���7�87�I*$��E�����3k/r����7�/V�q�Hb�_��ZU����'SM�,����T�p=^�'P�V��"���3��N�n���8i�A�p����Jl�jL�Q�3�v
n�3:u=����������=*��t>Yy��+t�����o.�r���r����f
z.{,�\����VR���I������-'�����������{famy)�h����y
I6����A��@��\����Hy6���]����C!H�t���T^;�z�VyN�$��L���$��c{8/��W���\iS�_�]�r���z�S#���e���l8x�\A�>V"H�Z�z���8q�7�rUA06��0�~��a/�����
4G���3�G[v%��p���ESlr����A1�x�T	�+�{��!�;�?���Nz�Y���5`��Gf���4���n�+�m���U����_Hd����,�o�3��L�K�A�k����V�x�.(P��5E)L��3DK���M�TY���l���E���/�a��������j'@6U���������eb0�� xy�����R�:B����G�=2D>��������g)�ID;��'l�.A�O��E2��Q-��,����,����V��������6�R�7���B���L�����vvz~�L���6W#���b�	���K��n����&����.l���W'�����o�W�w%��(��W���������m�o����������Rd��z�Cqr+���A���������p��SV;A���[Y4ND�x�R?�X���S��)���z����Z���[V^;/�[������''�dt��1��:r���x��������:�����W�JV�����}C�W��%���d:��,n��i!���8R��pYiMB?E@�,M+3�#�������&����&H���`<�7+��K��9�%#��j���YV�
����z���2����K��x�d0��7��)�4@�)�X$�����>�56pk�f�D&'W��!��.�
�8Z�e�xP�����T�s���X�3(������v�%IaZ�'���[�,Rqd]<k.4s9��3P���F3�8mn3��xq�'0���U���4��(U�Ra��)�,�L��Ym?�fY��I��T���z��b��:�D�W�$T1[�u�T��������O3��/c��o�;m�.�~S	
���_���=#��g����S���tE�2��o�Krh6re�5�
^i48���D#Y�����n�=�G����"�2Q_9����'����h�7:w�O�K ��M2�o��
2F?��4�����<9����`��l���,)tSh=��������TQ��S�a��h��p��,������B���R�m5�*����]oT
��H�����x?�l������1b�?!�56����=;
BL�~@�la�:�"v�P����o���������C�fq����	(��~�G�1���]��
�=��)�~
18e���){�a~r�^��v����3�����wh�eG8����6���*Ht��@�n`�S��dA���\�h/�q�'V0�	�9b���9]���'��h���6�����5�
�Z`�n$v��j�����fBm�����f-�Q�v�Lc6��`���%�����dT��������
�@�L	��`�$e �hS�L����<��c�������(U�f��e��c(���a���9�N<r!��{�CtSx�:��oC�!�y7s�<��1��6�E�,Qz��|IF@W�Z/�_vJ�0���i�!I�Y���R�����H�Oy��VG�[�A�n}�&�s�v��D�Lm��U��g�T��*�.���%�U�6��d�����z��Z�{��z�_���`�h%�0uMy{�0m���]��������|����>z�U�t������`�08�8������AYdl�
Y����
�u�f�V��}���r`���7����X��J�C�n���tk�����~c�3x^;��tIt��E�8���1C�*��t�q����F��t	���!�Q!���b8�( OG��H��8Rq /@��������;����]@+��~N����)\�����
�s?���O��/����y���,��/�SznS_W'���+���^�5����WG��6��~�TOO_���N���n��l�������x�wIAX5xv���L�}yu����9R#��}~���T?���{���#��`�^XA����N����P���>a���<����,z��A6%��v4����-���0]����D*�wL�aP�;�=�z?(��A7Fc��#�:��]1s�9�g�'�$5x���1n��P��`����m�����O��WR���^P��3����2�7���s��2v��������@"���M_*z�xd�B�,j4Y�����bFfP�h��2%����qz���y��-'���A�/��R�_W�|XN��9���/�.��5P�B.����~@NIxq|��������?�X��
)��3��6*f�%�_���p��_@���\��g��nj�����!-���bkdMN"k�[��[-������um�����M������im��j���������Z��h�:��{���5XW���,�4o������L�������� ���`?�����_���<B����`V,���7��0�O���Npm&r~3\�9� ����E�X�D���y�}����_�f��H�F���k�/�\d��Q�2����cY���}�,Q�3������h.���h~=�#���{��������0;��������xr~�G�������a�1%����r<��
oy<e�����-��K���p~y�E��5� �9���S��"��d���������~c��Q�]xRyB���9�0BT\yoW�x���������.��?�����}-;B�X�[#FAy3-����Z5�����D� ��(���O�&_��|��pW����T��hI��>���QdEL��l��m�jF`��3\�Z�>��(C��[H�P��JD

E "����3��4+-(BWMG�s�A��a�&����`�!3��M��Jq3N�z��9Ek�����!qX=��D.��%��Q��P'��%<�-�j���Tf�/3J�tN��;�������r�#�/l�'�.@���PM���S�|��Ws��G��4�����`�Z�Sj�*����{7�*b�����J	�SY�X ���%���9�8t:�s�S}S�
��|���o['W������v�����#���[o�TNF��&w�����5B�A���f��z�VR�����	�\c@��.\�n��Zp ��&��
U:��f-&�Rp����B������6�Z�TE�<��d+!&����������N}�:S��7,j���Af�$�x�~�w&��t]d����6�Fc���0E
�q���/��X�.��+����E'���]��$E7`�����.��
���=����������*M0*�0����*�E���A�/�C3�x����k�1�W�z��2��"��g�?�����bp�~���`��{��2���o�7$�Q������D�L(f�W�Md�P�6J���
�+gG�xu��$~]�����
�kj�)�^"&�����5���A�~4I��D��e���E��������BN����7^��������O�_�|�����h�D.~VY�u��%�H�k�G$�L�=�������,���A���fG��yk$P�����8���h����d0+/E��g����&��m��������l047���5��s�>�Ih�����l��8 ��������B���s$i\�������ERY��7�=k�P-�mx�d���_�����De(��!o��h�SgJ��O����\ES�C<�g��j�������|��d�3�������5�5���77������t$�����6��gr�����+�l|K`u���V��@u�M���\0�=%�b�?=�����p�r��U�,gM�p���9dN`W�(tx��������3oQ;�������Pl�
�� �v&c�_�`H������"5��R���'����5���g�
Lw��4��)?��/�	�E��(X}9����8_��&��8k������s������
���xFl� �8�=w'b'��C��r~�E����+��q�<�����p#���[��BHP"���`���O,�,Z�d���7>r�{	����-�|�]�|�y���:d�����9:�8b%p������F��^(�U.\��,Xu���U��Z�o�@���H6e'`�	`��B�`��YD4��`����KLT+�������{k�T�FD����t� �n���J6-.���=�c�	�>qu�La���6�w�~{yzy��V���Wu����6�"J}��lT'�z�t��Ce9X��&����U�-�,N�a������Y�3�����=�I���N��o��s*�|Cg������"7�?]Gs�K#��U��AV��/d��������3:�/;~�����^$85�:0�
���bo=XF<�Z�1�
6OG������K�nsa;��HD����3�DD��^PN����r���<����V7
�:��)�FUI�%%�xa��C�m��1��y4�K<X���	��
����^��6�m��8��1��Fn��T-����J�F��u���I�8���,:iE��������<��rn�r��7����)F���AK�������8���4���i��=J�D��w	L�k;��V<.?>V)~;�'`uh�JA��P=v��V*|6!+�c&��7�w(�`���b�����:�dDj��/+x�m���cE�}�����5i�����@�qi$9�� 	tY(�G�pV|�uc�n��(%j%;�+&�	�(�=M��{�+�8�dj��
��pM~s�"A��4��S�_��a}�B�����h�����b�`�C�.L���4�a��'�����,�.m���h����z)s$_>���}@i����b�J�Y4/m$
���;5q��T���7`�2;���h{�7
�����sQ���q��[*�Jl�<��b������mt);��!5�a��h]?��!k��c�����a�7��m�w.�����j�6���j]�|{,�(+�S�t���	j���v��J}��x~t	&����U����3��?�FoU;����T* ��!p��@�f3���:����.��~�	�_����a���R%���O���}�<�
�*@3��7�E���n~!/���j��k�g/-�\>��?�r�[i�)���w��&�03��	8���h��C$4��9����6:qWr�_���M��o�;�P3���Oc+��

i%K�VT
p�,������#tg���@{�T�r�&3����v���0:^M��?���c�;����������x�dVK�U�^|��"���C�j�S!Z�ROGH���F�5��6��Tp�c��4��'fn9��������]�
��E���W-�l��+N����!��
��"Gs��*�&�'����2iAU^���RP��x1�K�a�Fm�j$�w���?�Ih������p��^����M��UA��W !�
���me�������<`��&m�!d=4�����bD'�����dy$7:HF�cLYUB��tIhER���N?��8���=���6x��<X]�$ P���?~�@����3
/m��Y�?(a��N��v9���#8�p_��[D�_�8��p3#�cn�
�|�]���!�B@j���R1�
Q&<aH��!sI�"��f���D��P"[��.s�]�	���6�'����3$��\UcYm����+%
��&�I�b���p&(/6���,+	�h/�?�����'�a�|
!dt"Lib�S!�2�vTw�Y!����4�w��Q��o"�xVG��S"/)�����m����XT��^'�;a|�A'���7D���?��Q��r���
�a�|	��=�������e[�7�93��b�=B�+�|�6�D�nB���vy��x�a���a�,y�w({d}��9���$q�Es�3�����K��d3m�R&?���"?�
�VY)����2�;������Kn���n�����%	8���%��'�[�)����K6�j���^���JU�!�����k��_!���}��V����$M �!�d�pv�dp�0�H�Y(��h�����lH��`�����Z!?�h�RPX8��;(���p�[��������R��|��E�N�A��TA���V*��6��%P��R%B��QT��?L�S]�������5���'��"GW-$�d��$��5�H�0'�s'�I�l�:&��]]K��r+��L�����Ct�G��k�*���;����������f��������N�#���`�`][�
N��=����
���1������	��c����]��%�&��0K'���������=��K{o�	��u�U��L�I�����1���!]�4���->N$YX�?P��7a������l��)H��r�%�iJ-�
������� ��R������@mVV�$2mR�� �a�~��0�������A+f{*��+��Bv���N���=���7����k3�Q��AwN(�*hM,�$�(v���+	��!L�����y����$���0�UM���W!u#H�(4�j��2")���G��F;��>Ih�s���m��E��(�sY ����s� �R��&��!�,��Za��$�lZ ��lH@)������>F��;�iE��������]���!��?�`���\�%���� �"�������'�Ta3*�$;X���;Sc��4����$��
��3RrBw�F��q�k>����8�������������a��D���AFS��d�9
��X��x�a�_��:{]�q�%R��}�|la���Z~3$�~�Y��mF
�k�.B
F�J�A	�s���s��6#,�K��s������Lg��b���D�NM�8�[r�|-y�����R�z��	��v��)sh�"�O��I��yE3��>�A
�����uq|q������rSjj�-�Bm�[gE~;����N�&�^B������U��o�5�=�
�&�*����G�J9{����~��B��	W�{b�A�sJ���4i'p#���z�eH=�v�s IH����W��������U�'\�x�����&�p���P�h�+�Jy�Z8�
����H�o�|/��a9��4��zB���Y�����g5�azE��J.Q��jZ;y��Q(����@q'�"����A�`������{����
��������*���-h�$h������{f�<��W�cB��s�s����%d���&���i��<����"2������6��Y��2���(���B)>`�����_/<K�(�� ���Kb�H�v���fn��9�'�+��p�����������+��	|����`�Gk�kQ��>��o3�S����b�����=�o��O�^N�~������!��l|U� ����iI��S�y";[����Aq^w�����nc{o��wo����+* 2�6d�,��-u2 (�T�H�.��mW�(�=b�#cY�"�v�;�������&����%e�f��[#��0�U�c�j�z��|R�;�t	P�,�H����
j�A����D���`�����Fe���1�M���x���d(bf���fpH�l(9��	�6��2�	R� �*YY5����N���g�J�"�3��:A��j��jC�mR���Z�2#�
fh��u�pE����K��
N���!T��Z`.{������v>����L�Ij.�@%m��3N>�����*!�H�O&Q,z4o���+M9�����.��R1}���-(��/��}PBf�z�������]H���p��MZ�-k��q?�
JfK6T�,�m��J���It�w�l������@�#G��4���GE�^���"k����F�8���9��f�g�u${[;�Dp�Aq�����WA����������bRf�;�2���\��M��3��z
&g�$hW��0��9����p�;�[��'�a�7�M[
�P���������\�%�����Q%����*9L��j�)�'vib��C>�@)�
	�����C��ZF2��p�f��F�r�����n��*l�h����Y}I�������eFB&�)T�6�Ua���:���MF���	]Co!���8C��s&a���P��<7����T`�9��,�pO�EI<���
4m���W��_K��A�y �]R��t�v,c��.��W�XI�/I�e>KlQ^C��/
��/���L���'�'���\��u ��4!�m3��H'�A�K���}��]L�:������ ����,��
s�1+r��^��Z��/���L{�x4���V8F|}d)�-��K���"�t���!�p�%�O�!����}�I���u6��������p��A�?�v�L6nv������N��)�6����`�X�<`��%�!:M���Ts)��J���(s����{@�(�����Z`���.	,+�1L�k�������~�vP������vV��lI��E��0��]Z<�fa������:${�y<F�y��������0\��;E���q��7R 8=2����X������%���)�>~	��U���U}`s��pbSL2�*il��������k�mqw�W��]JU��Q�mQ�8���
F�`K`N��%i���5�]�EU��i8�a���	0@w�Q�sC`�K-Q�D�Z���D��`�!��|�f��9.�;����K��!$����//a����fA�_�m�����GTPP6Pb��.tS�~t�4�F�����
`w�}C��Gg���!X"j�G}�?*oa����N��e[*����f��+�����Y�E�hXQ����.���@Z��	M������ L�];�V���^�%H{0�,OC�R�V�+��g��(y�B�ek��6��8�3s�6�� �]��n��lm�`W�o4�p �u��;�~##��|{a�"zg��-�������z�������X�v]N��D�j�X`����=%a�5T!|Xc���L�+8����9�Q��O��(�Kt�0B37�;��x�O�|�����v�������9�������G>����SLp���6-�F:cG\.�x�
�0kn��A��C���`�|��������Qx�R�(ps<�����<yH|$��t�h3]�WI
Z�R��������s��|@���9�&�Q�ad�7�k:���\���#���C�p�����]�G����\�����c�vu������������&��"H-����G�
�����G� (ugs�fC"�.��a$D2��z�Y�;�hLub�F D��s;t9rd�cF&�6��c�H��foFHW�$
r������>d���a�&����u��"���w��������c����A�=�*r��K�-������	v����7���j�����|��pf����<���B\f�)41�)������KjLg"���)F�6iM�S(�kf�I���t�����6�`HZ*%1g���$d������i�����^����0�	*��!����j�zm���,N��ng~s;�OR����&Ep/p�l��]�= ��u�h��KB��Gt�cpV�B�t�kw�O���O7�(U�����-�o;��E����z�*�����-��L��p�Bh�5.>�0~p/������|�,�|�Q`��@�����e�1�SS��O��k�fM�h>C�s+�������LC5dN�Y���n���*Y��nH�Z�1�O��iT��!�3������b���Q�9��J���;
�kT������n�8�#��!�u��p&����'��<�)���W�\�y�������p�N��������8+7��>0���G|����u��OIBg�7�=�b({��z(���r�R-�J���X�Q������d��J�(B��n��H��+(�&��2�����
!����g�����l��P��$]r-�P�4���;�'�N=Lh�ju`r�)Adez��Yf���NZ�G��@��v�[���D_�[���k�	�
nT���^��R�WO�%�P���l�v��l�R:�}�3�v�@@F��3�JA������
k���V��3�����x*o����c
��3���c��<���������@H4��4;����p������-�H����y���C���G�Do�^��!@��9[C.�������K��iJ@��)/l"!�����cTx.(��\vLZ

n��&
�(�Rg8��$\�(n�P���}�	#{�gl�������n���bV`��%Gb�X~�3H����B�Q7�]�m��p2�1��>�?��[T��)6�@�`�����6��'=I5�e���.����*����Dp�Q���^�6_$���28;���oKQrfGl��XOf���G<�kY��/p���,��o��[
�9�,	�f<����"�,:�U�N�[���7W��Z���j�T�[�"����(������J��0|>O��;�n���p��HL	�?k���9�=�?G����>Nt1�}px|�`�|1���`=@�P�BEG
j��[J��S���c��W���K���73|���[�L p��cbDB����������R�De�5L'��
�J2W�`e<����������\�aR|��V�3>Ea
��s���a��HE����vsoh]���d�e16�'1D/��B�[V����tf������Q����/��`jc�{�7c�s4�<��>��?�2i������%�V��LZ��2i�{��F�� �#! 5��M��g��W��|$�#����-�Q�e%����������q�`��kh�`��W���xY���{�/�~/���(��T'�����N��x���Z��S��J����g���g�/��Cd�f��!s��=�v|l[�	p"lbCb[7(��Y%Z]�}q�l�g�+�f�E�s��l<��k���uq�Z5�������a*9�lH+���1qu?)���nc����H�������o������}����s6�:�W����d(������/�@Z�z��������Vxd}���*!�0��<��h_��8?�-�+�|�~uz~t������]��MP�����Nr������99�f�l����1��iYi�
;Z�?���|<q?��Vt�c5P�>\�Wg�?7��_�ed�1�,���<*`���SZ�t������O6�?3�52m0-6������1��F��J g3��{Ge��T_�!c@�������,?	~v]>���ace���`�������8j>�#�A��@r�mQ{�9�p�)�G����g�+��d&#�%���{3�UJ����x��U�~g0���I>��@��g�,v~p��\�;b(�}����T�76j�u��[���=���>�Rvl�:�eV��hX����Cv��9��e�9���a�J�"�����
��9��88�hy���f7��'���n#pM�c�)�sB0��Y��-69c���	:�!Kj�g�0l��6D P��y�bf�q���n!JS���zp�j�Y���Pj�������A;LK����t��|��^9n���h���L�����L����
&���K*C�0�
���w�`�?���4�6��	@����_
�����h�i�f&
X��gE�*%Di4������9�P��������<�ML�-;��a�L�����`� ������t�c*��������tp�UK�l��;3�s�6"�R�+p���ox?r��`��KP�QX=�*����i�b3Mf/���Q2.4�m�*�4��]�\c=��U�J6����4����@�%��#���dNBI��Y��s�K�`�$K��������D�K���l�(r�A�UOa,��M|z	Oiq4�s���V�U�n�Nc��*����W�$����-��ew����WEI�H�������J ���>�o��;���\�O���P-T"��!$b���\iU�!L7h�cL�NO�:"�j�C��n���a����U�E��=8.�������=4������:�'B>H��.�;���v���(k�d�`�I���$�
81��)n0��Hu��5Dw1��N�1�.���������Pg<��S@���\�M\�bG�w�3���'hf%�p�==���k�}��R�t��<���E��$�����8��.t(v�\��8f����<X`�>��7����LWM�#��@����!�q�������<�ny���(������Y�|��Ni�;F��,^E���0��9��tN��N�\��?>)Q�Y+:��@?O��MZ6��(�������C����VS�ct@����I�Y6�P���!w�L��Z�9��T
M�D��9�{��'6�'�so|�����|�Y��"�9l��^��iR�����|$�U�"���L��Yt�xx,�
������P�"���������G���L������B��v/R'�^e�;
�������)7����w��z������]��K�V��Y���'��c��<'�<��d�~+�����xz	�3�����hq1��*���u2����l�'�O*������>��AA��Mv�(��wl������n�`8]��,`��Ig��i���� �b����q��%�7��Y�=���?���x�[��V;��$8�;�����9����G�Yx���@��
?���������4�`�p6+�:�6nr�����PX���;V�K[�������g����S����;�������!p��>|0������c�<~�7P��M<�����@vo�����Y0F�J����v�Ft]LB������U&�?	���
�1(����HQ��6P<��}7)rc�kS���F�(�2I:��%;�]��K�Ga�*���6��{Y�y��a|�?/�[�N��v;�@h�����B������VA��������H .���&SL��S�RC�dO���3�!G�������fC�,F ��e�A��^|n�e%�k;�������v�sg0��������"P^Z���b��T����?A����vGa�A��:=h�(���o���!�U(��p=�k(������H��t�}#Q*K�Ld-�;�W�~�4\4�T��r���)�*J�����RF�������^�,����=�����B�]M��+%�%�����{��~���(+����7�d�;��R	���������\������������AO���n?��J�8$���I���r8�Iu�?i��	]�{�M�N�;������$��{^9)��7�0�`��3DD!2kU�?ViN\�����b�,/��������V��fD����s��%�uZ"�a�q��d���y[� ^U�6NQ����Ip�O��[��t���,�M�����~����L��SG�o7�l���"��T�6#�=E��N�?HX�5�`4T���0��-#'�i��BV+������?���Jw��Y�A�ry{BT�XKQ&}M>W7��f]jr��9P3Aj7�(A�n�G�6��U�(!�����Z�i�����sM�
�Np#��W��TP��9��jd.����U���+<��H�^B�1n�0HV�a�P=Pu��+*|�$$�E��m@H�Ij�w�|��l�H��A��M\h:_Y�������0k���_��|��������.�����t��G�!��V����w>����|�{�����Y�H"jwP���.5P��$m$j�`��2'	4��B�A3B+�o���f�Y��������#f$�=8������H�3���?�������Wg'T��vE5t���	o.�]}�/�L�c���������f>^��N�\��)��^�
-7���OE:��X��5�w��:�0QR��6$�6oJ��`���A&��pv�f��g���wK8�v
���3:e�����G�,�$���*�\�dA���xf�����I���e��cDfQ
���$h��.s��[�i�
(��9p��@��V�o���h!�J.9N�?��(�U��U2L�~=�����W�x�]���78VKVn�o�hb�v����bO�?�����[�����?�^��k�-����6D�UP�n�����U�R����-�������N��Kz�����F�����8?(���Y/�~�~���k5{V�'S����D�i"�~�;�.K^q
��T_�((���9i��,�8v�*n�0K���6�q��]���T��f�T�y�x<��
|E	z�>�2�_d�g��!|�7wYS�r�}���j�n���]k�l%�������h7���I�8@�������:]�N=��3���u�JLq�����n��hRd�vU
�4�L<S+1�d�XV9�������^���@����}����d�)���`�V�m��yg������pr�L� ��n�J�$�}��}Y�#B��Q.��&F��N����4����g��������������JS��FS�����[{��5�y`���{�'����{���N)�E�!D�;�sa{���3{��z����1����3@85��|��8���(�����]*b<.YG�4��{!����?b�=��A'k��)�������0��l���r��S���`�m�A56�s�6��95�
�\6�,�]���,�
'���/�+G�*��r3��%d��i�(	A��2��c�R���&�@lRr<e�o7.�/����l�D��RA�=�grt�5@hBl� 9���&;��s0������Q���]aO^2R��Q��ch��z[�[��Y\th
7����f0On@���V���A�T���5$������]�7�����=�8����I���Yi��]�m�p>�.��F��S�A�Zm�w�H��h�����d�K2�R_�!\�Is~��e~��[d�c��<%Cd��r����,:r����/*p0M��g,{-x=�eCW7A��n��4*�z�SL��P���i�Li��;�''k����u�E��UZ�L���,�Z��>�����l�-,�o�[O��{�iT����=�n������`UU2R�}�P��~��:�"r�$����2�&Ex�9�\r��{����;�X�1����Q2�37� ��4��f�w�ro�������gT7ScejR����ti�SJ-x�4H�:����|�#U����[�����_��>�Vm���:+~��c`�2m���v'�E���*��>H$
���l�`�mkK��0�O,�
r;�m�J����������>-�J7K��X�g:���?K!�s�����i��������j[�[��~���������_�`]	x������5j|/��	�x�[�.~h&��������c�_���C�����>�w���'�D��&x2�Q�zpf�G�5Et��2���p������I'�S�y�B��4��i�S"	e���~�	8���y��R��1�SYq?d�k�8���P�p:�`R����0�\3���"�2��[R��Q��_Q���-y�Y\��1��Ru1Z:+���7�L-/��<7M��(���`���+��jf(���������L7�j,hY��Xd6���`#��j��=g� �Ak-P������V��#7�T#D�5�GM���eq�z�T���]�6�\_� ���G#�0
�������	���~��;��vv���|���NC�=�OoR��M���qo��[�C�[m��x���c�%�e8�W��%]����������Uput���}p�zwr4[��Y5���]6��=Ojck�������{����{�&U��7��N.�{��zn�T��=J���E�f�E1b��p���68�H��c	�f5��Z���f�b_(�����
/����["I�alxq�7����#(���M�/^�4W��������[�7�n�z��0jk��s}���P�-�7V~Yfcv���;9lL~ZapT��O��?������.���W�Xb��%[��NS�x4+����B�;����`�O������M�Q�K*[+}�vb���?b��q��E���~�a����`��VE?o�FPG@e_H��%sW2?�U������y��0��a�pT�5��N�N������
��33�TY���u1�����U7���6��G�M�D�P�z�3V���"�"����vG5����Er�-	��Y#I
"�,q`���,�����H��p9%�N��x�������d�]e�����8��w��&� �dt���[p��\G� �#J}��FL�)�a��C�����*��k��,`D�p�"����^�Y�"�{J.��Yy������8��d��c���`�j=��(�D��2���9Ig��T�F����>�moc`H�5d��kM��o_�������jU���0��<N�����h�5�m@��"�H*�y*��P��0��S�-�1���]�a?��6����g���UX�=�!<����w�
��}�	@OPF��}vEo0�r��X�S�r7e~w3�~�d�CD����)w�|9U+���UX��6��{��.�L��������e����U�$m��2��k�H�������,����
anG���)��_�2��Y��(T�fw��>�R��D!l��rf��T��gDN�(�O��$6�*sE�U�D�����P]|���nm�s�X��h2.^��#��e[���lisCP[!K�2M�;�ix�w��~h�2�����X����7��~��}��	w����&���[��}l�0�Da����|���q�x����_�p:��8����2`J������B9����NJM�����?G�����~�>=�pnk�
?E����.Z��q��\�r�e�
IS�1���a?/�yJ\DLB�Xqc����'D+J�r��6?�>r��t��W4����Rgo>[�$������s'�E��dNd�$GSXH��](,u�d[�F?������,9����-UJR��n�\w{�Z����om�w�t;�O�.I���m4��OrM��a����n%�H���	�����d#�
�))Q\���3�
o6����[(����S�����kE�|���"k���2�vpm��j��v��3n<���,�M�f�����(��
#8Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#7)
Re: Switching timeline over streaming replication

On Tuesday, September 25, 2012 6:29 PM Heikki Linnakangas wrote:
On 25.09.2012 10:08, Heikki Linnakangas wrote:

On 24.09.2012 16:33, Amit Kapila wrote:

In any case, it will be better if you can split it into multiple

patches:

1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.

It can make my testing and review for new feature patch little

easier.

Yep, I'll go ahead and split the patch. Thanks!

Ok, here you go. xlog-c-split-1.patch contains the refactoring of
existing code, with no user-visible changes.
streaming-tli-switch-2.patch applies over xlog-c-split-1.patch, and
contains the new functionality.

Thanks, it will make my review easier than previous.

With Regards,
Amit Kapila.

#9md@rpzdesign.com
md@rpzdesign.com
In reply to: Amit Kapila (#3)
Re: Switching timeline over streaming replication

Amit:

At some point, every master - slave replicator gets to the point where
they need
to start thinking about master-master replication.

Instead of getting stuck in the weeds to finally realize that
master-master is the ONLY way
to go, many developers do not start out planning for master - master,
but they should, out of habit.

You can save yourself a lot of grief just be starting with master-master
architecture.

But you don't have to USE it, you can just not send WRITE traffic to the
servers that you do
not want to WRITE to, but all of them should be WRITE servers. That way,
the only timeline
you ever need is your decision to send WRITE traffic request to them,
but there is nothing
that prevents you from running MASTER - MASTER all the time and skip the
whole slave thing
entirely.

At this point, I think synchronous replication is only for immediate
local replication needs
and async for all the master - master stuff.

cheers,

marco

Show quoted text

On 9/24/2012 9:44 PM, Amit Kapila wrote:

On Monday, September 24, 2012 9:08 PM md@rpzdesign.com wrote:
What a disaster waiting to happen. Maybe the only replication should be
master-master replication
so there is no need to sequence timelines or anything, all servers are
ready masters, no backups or failovers.
If you really do not want a master serving, then it should only be
handled in the routing
of traffic to that server and not the replication logic itself. The
only thing that ever came about
from failovers was the failure to turn over. The above is opinion
only.

This feature is for users who want to use master-standby configurations.

What do you mean by :
"then it should only be handled in the routing of traffic to that server
and not the replication logic itself."

Do you have any idea other than proposed implementation or do you see any
problem in currently proposed solution?

On 9/24/2012 7:33 AM, Amit Kapila wrote:

On Tuesday, September 11, 2012 10:53 PM Heikki Linnakangas wrote:

I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby
server that you'd like to keep following the new master server, you
need a WAL archive in addition to streaming replication to make it
cross the timeline change. Streaming replication will just error

out.

Having a WAL archive is usually a good idea in complex replication
scenarios anyway, but it would be good to not require it.

Confirm my understanding of this feature:

This feature is for case when standby-1 who is going to be promoted

to

master has archive mode 'on'.
As in that case only its timeline will change.

If above is right, then there can be other similar scenario's where

it can

be used:

Scenario-1 (1 Master, 1 Stand-by)
1. Master (archive_mode=on) goes down.
2. Master again comes up
3. Stand-by tries to follow it

Now in above scenario also due to timeline mismatch it gives error,

but your

patch should fix it.

Some parts of this patch are just refactoring that probably make

sense

regardless of the new functionality. For example, I split off the
timeline history file related functions to a new file, timeline.c.
That's not very much code, but it's fairly isolated, and xlog.c is
massive, so I feel that anything that we can move off from xlog.c is

a

good thing. I also moved off the two functions RestoreArchivedFile()
and ExecuteRecoveryCommand(), to a separate file. Those are also not
much code, but are fairly isolated. If no-one objects to those

changes,

and the general direction this work is going to, I'm going split off
those refactorings to separate patches and commit them separately.

I also made the timeline history file a bit more detailed: instead

of

recording just the WAL segment where the timeline was changed, it

now

records the exact XLogRecPtr. That was required for the walsender to
know the switchpoint, without having to parse the XLOG records (it
reads and parses the history file, instead)

IMO separating timeline history file related functions to a new file

is

good.
However I am not sure about splitting for RestoreArchivedFile() and
ExecuteRecoveryCommand() into separate file.
How about splitting for all Archive related functions:
static void XLogArchiveNotify(const char *xlog);
static void XLogArchiveNotifySeg(XLogSegNo segno);
static bool XLogArchiveCheckDone(const char *xlog);
static bool XLogArchiveIsBusy(const char *xlog);
static void XLogArchiveCleanup(const char *xlog);
..
..

In any case, it will be better if you can split it into multiple

patches:

1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.

It can make my testing and review for new feature patch little

easier.

With Regards,
Amit Kapila.

#10Daniel Farina
daniel@heroku.com
In reply to: md@rpzdesign.com (#9)
Re: Switching timeline over streaming replication

On Tue, Sep 25, 2012 at 11:01 AM, md@rpzdesign.com <md@rpzdesign.com> wrote:

Amit:

At some point, every master - slave replicator gets to the point where they
need
to start thinking about master-master replication.

Even in a master-master system, the ability to cleanly swap leaders
managing a member of the master-master cluster is very useful. This
patch can make writing HA software for Postgres a lot less ridiculous.

Instead of getting stuck in the weeds to finally realize that master-master
is the ONLY way
to go, many developers do not start out planning for master - master, but
they should, out of habit.

You can save yourself a lot of grief just be starting with master-master
architecture.

I've seen more projects get stuck spinning their wheels on the one
Master-Master system to rule them all then succeed and move on. It
doesn't help that master-master does not have a single definition, and
different properties are possible with different logical models, too,
so that pervades its way up to the language layer.

As-is, managing single-master HA Postgres is a huge pain without this
patch. If there is work to be done on master-master, the logical
replication and event trigger work are probably more relevant, and I
know the authors of those projects are keen to make it more feasible
to experiment.

--
fdr

#11John R Pierce
pierce@hogranch.com
In reply to: md@rpzdesign.com (#9)
Re: Switching timeline over streaming replication

On 09/25/12 11:01 AM, md@rpzdesign.com wrote:

At some point, every master - slave replicator gets to the point where
they need
to start thinking about master-master replication.

master-master and transactional integrity are mutually exclusive, except
perhaps in special cases like Oracle RAC, where the masters share a
coherent cache and implement global locks.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

#12md@rpzdesign.com
md@rpzdesign.com
In reply to: John R Pierce (#11)
Re: Switching timeline over streaming replication

John:

Who has the money for oracle RAC or funding arrogant bastard Oracle CEO
Ellison to purchase another island?

Postgres needs CHEAP, easy to setup, self healing,
master-master-master-master and it needs it yesterday.

I was able to patch the 9.2.0 code base in 1 day and change my entire
architecture strategy for replication
into self healing async master-master-master and the tiniest bit of
sharding code imaginable

That is why I suggest something to replace OIDs with ROIDs for
replication ID. (CREATE TABLE with ROIDS)
I implement ROIDs as a uniform design pattern for the table structures.

Synchronous replication maybe between 2 local machines if absolutely no
local
hardware failure is acceptable, but cheap, scaleable synchronous,
TRANSACTIONAL, master-master-master-master is a real tough slog.

I could implement global locks in the external replication layer if I
choose, but there are much easier ways in routing
requests thru the load balancer and request sharding than trying to
manage global locks across the WAN.

Good luck with your HA patch for Postgres.

Thanks for all of the responses!

You guys are 15 times more active than the MySQL developer group, likely
because
they do not have a single db engine that meets all the requirements like PG.

marco

Show quoted text

On 9/25/2012 5:10 PM, John R Pierce wrote:

On 09/25/12 11:01 AM, md@rpzdesign.com wrote:

At some point, every master - slave replicator gets to the point
where they need
to start thinking about master-master replication.

master-master and transactional integrity are mutually exclusive,
except perhaps in special cases like Oracle RAC, where the masters
share a coherent cache and implement global locks.

#13Josh Berkus
josh@agliodbs.com
In reply to: md@rpzdesign.com (#12)
Re: Switching timeline over streaming replication

I was able to patch the 9.2.0 code base in 1 day and change my entire
architecture strategy for replication
into self healing async master-master-master and the tiniest bit of
sharding code imaginable

Sounds cool. Do you have a fork available on Github? I'll try it out.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#14Josh Berkus
josh@agliodbs.com
In reply to: Amit Kapila (#5)
Re: Switching timeline over streaming replication

Yes that is correct. I thought timeline change happens only when somebody
does PITR.
Can you please tell me why we change timeline after promotion, because the
original
Timeline concept was for PITR and I am not able to trace from code the
reason
why on promotion it is required?

The idea behind the timeline switch is to prevent a server from
subscribing to a master which is actually behind it. For example,
consider this sequence:

1. M1->async->S1
2. M1 is at xid 2001 and fails.
3. S1 did not receive transaction 2001 and is at xid 2000
4. S1 is promoted.
5. S1 processed an new, different transaction 2001
6. M1 is repaired and brought back up
7. M1 is subscribed to S1
8. M1 is now corrupt.

That's why we need the timeline switch.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#15md@rpzdesign.com
md@rpzdesign.com
In reply to: Josh Berkus (#13)
Re: Switching timeline over streaming replication

Josh:

The good part is you are the first person to ask for a copy
and I will send you the hook code that I have and you can be a good sport
and put it on GitHub, that is great, you can give us both credit for a
joint effort, I do the code,
you put it GitHub.

The not so good part is that the community has a bunch of other trigger work
and other stuff going on, so there was not much interest in non-WAL
replication hook code.

I do not have time to debate implementation nor wait for release of 9.3
with my needs not met, so I will just keep patching the hook code into
whatever release
code base comes along.

The bad news is that I have not implemented the logic of the external
replication daemon.

The other good and bad news is that you are free to receive the messages
from the hook code
thru the unix socket and implement replication any way you want and the
bad news is that you are free
to IMPLEMENT replication any way you want.

I am going to implement master-master-master-master SELF HEALING
replication, but that is just my preference.
Should take about a week to get it operational and another week to see
how it works in my geographically dispersed
servers in the cloud.

Send me a note if it is ok to send you a zip file with the source code
files that I touched in the 9.2 code base so you
can shove it up on GitHub.

Cheers,

marco

Show quoted text

On 9/26/2012 6:48 PM, Josh Berkus wrote:

I was able to patch the 9.2.0 code base in 1 day and change my entire
architecture strategy for replication
into self healing async master-master-master and the tiniest bit of
sharding code imaginable

Sounds cool. Do you have a fork available on Github? I'll try it out.

#16Amit Kapila
amit.kapila@huawei.com
In reply to: Josh Berkus (#14)
Re: Switching timeline over streaming replication

On Thursday, September 27, 2012 6:30 AM Josh Berkus wrote:

Yes that is correct. I thought timeline change happens only when

somebody

does PITR.
Can you please tell me why we change timeline after promotion,

because the

original
Timeline concept was for PITR and I am not able to trace from code

the

reason
why on promotion it is required?

The idea behind the timeline switch is to prevent a server from
subscribing to a master which is actually behind it. For example,
consider this sequence:

1. M1->async->S1
2. M1 is at xid 2001 and fails.
3. S1 did not receive transaction 2001 and is at xid 2000
4. S1 is promoted.
5. S1 processed an new, different transaction 2001
6. M1 is repaired and brought back up
7. M1 is subscribed to S1
8. M1 is now corrupt.

That's why we need the timeline switch.

Thanks.
I understood this point, but currently in documentation of Timelines, this usecase is not documented (Section 24.3.5).

With Regards,
Amit Kapila.

#17Hannu Krosing
hannu@2ndQuadrant.com
In reply to: md@rpzdesign.com (#12)
Re: Switching timeline over streaming replication

On 09/26/2012 01:02 AM, md@rpzdesign.com wrote:

John:

Who has the money for oracle RAC or funding arrogant bastard Oracle
CEO Ellison to purchase another island?

Postgres needs CHEAP, easy to setup, self healing,
master-master-master-master and it needs it yesterday.

I was able to patch the 9.2.0 code base in 1 day and change my entire
architecture strategy for replication
into self healing async master-master-master and the tiniest bit of
sharding code imaginable

Tell us about the compromises you had to make.

It is an established fact that you can either have it replicate fast and
loose or slow and correct.

In the fast and loose case you have to be ready to do a lot of
mopping-up in case of conflicts.

That is why I suggest something to replace OIDs with ROIDs for
replication ID. (CREATE TABLE with ROIDS)
I implement ROIDs as a uniform design pattern for the table structures.

Synchronous replication maybe between 2 local machines if absolutely
no local
hardware failure is acceptable, but cheap, scaleable synchronous,

Scaleable / synchronous is probably doable, if we are ready to take the
initial performance hit of lock propagation.

Show quoted text

TRANSACTIONAL, master-master-master-master is a real tough slog.

I could implement global locks in the external replication layer if I
choose, but there are much easier ways in routing
requests thru the load balancer and request sharding than trying to
manage global locks across the WAN.

Good luck with your HA patch for Postgres.

Thanks for all of the responses!

You guys are 15 times more active than the MySQL developer group,
likely because
they do not have a single db engine that meets all the requirements
like PG.

marco

On 9/25/2012 5:10 PM, John R Pierce wrote:

On 09/25/12 11:01 AM, md@rpzdesign.com wrote:

At some point, every master - slave replicator gets to the point
where they need
to start thinking about master-master replication.

master-master and transactional integrity are mutually exclusive,
except perhaps in special cases like Oracle RAC, where the masters
share a coherent cache and implement global locks.

#18Josh Berkus
josh@agliodbs.com
In reply to: md@rpzdesign.com (#15)
Re: MD's replication WAS Switching timeline over streaming replication

On 9/26/12 6:17 PM, md@rpzdesign.com wrote:

Josh:

The good part is you are the first person to ask for a copy
and I will send you the hook code that I have and you can be a good sport
and put it on GitHub, that is great, you can give us both credit for a
joint effort, I do the code,
you put it GitHub.

Well, I think it just makes sense for you to put it up somewhere public
so that folks can review it; if not Github, then somewhere else. If
it's useful and well-written, folks will be interested.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#19Euler Taveira
euler@timbira.com
In reply to: Amit Kapila (#16)
Re: Switching timeline over streaming replication

On 27-09-2012 01:30, Amit Kapila wrote:

I understood this point, but currently in documentation of Timelines, this usecase is not documented (Section 24.3.5).

Timeline documentation was written during PITR implementation. There wasn't SR
yet. AFAICS it doesn't cite SR but is sufficiently generic (it use 'wal
records' term to explain the feature). Feel free to reword those paragraphs
mentioning SR.

--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

#20Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#7)
Re: Switching timeline over streaming replication

On Tuesday, September 25, 2012 6:29 PM Heikki Linnakangas wrote:
On 25.09.2012 10:08, Heikki Linnakangas wrote:

On 24.09.2012 16:33, Amit Kapila wrote:

In any case, it will be better if you can split it into multiple

patches:

1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.

Ok, here you go. xlog-c-split-1.patch contains the refactoring of existing

code, with no user-visible changes.

streaming-tli-switch-2.patch applies over xlog-c-split-1.patch, and

contains the new functionality.

Please find the initial review of the patch. Still more review is pending,
but I thought whatever is done I shall post

Basic stuff:
----------------------
- Patch applies OK
- Compiles cleanly with no warnings
- Regression tests pass.
- Documentation changes are mostly fine.
- Basic replication tests works.

Testing
---------
Start primary server
Start standby server
Start cascade standby server

Stopped the primary server

Promoted the standby server with ./pg_ctl -D data_repl promote

In postgresql.conf file
archive_mode = off

The following logs are observing in the cascade standby server.

LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
LOG: record with zero length at 0/17E3888
LOG: re-handshaking at position 0/1000000 on tli 1
LOG: fetching timeline history file for timeline 2 from primary server
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
LOG: re-handshaking at position 0/1000000 on tli 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1

In postgresql.conf file
archive_mode = on

The following logs are observing in the cascade standby server.

LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
sh:
/home/amit/installation/bin/data_sub/pg_xlog/archive_status/0000000100000000
00000002: No such file or directory
LOG: record with zero length at 0/20144B8
sh:
/home/amit/installation/bin/data_sub/pg_xlog/archive_status/0000000100000000
00000002: No such file or directory
LOG: re-handshaking at position 0/2000000 on tli 1
LOG: fetching timeline history file for timeline 2 from primary server
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
sh:
/home/amit/installation/bin/data_sub/pg_xlog/archive_status/0000000100000000
00000002: No such file or directory
sh:
/home/amit/installation/bin/data_sub/pg_xlog/archive_status/0000000100000000
00000002: No such file or directory
LOG: re-handshaking at position 0/2000000 on tli 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions

Verified that files are present in respective directories.

Code Review
----------------
1. In function readTimeLineHistory(), 
   two mechanisms are used to fetch timeline from history file 
   +                sscanf(fline, "%u\t%X/%X", &tli, &switchpoint_hi,
&switchpoint_lo); 
+ 
+                /* expect a numeric timeline ID as first field of line */ 
+                tli = (TimeLineID) strtoul(ptr, &endptr, 0); 
   If we use new mechanism, it will not be able to detect error as it is
doing in current case. 
2.   In function readTimeLineHistory(), 
+        fd = AllocateFile(path, "r"); 
+        if (fd == NULL) 
+        { 
+                if (errno != ENOENT) 
+                        ereport(FATAL, 
+                                        (errcode_for_file_access(), 
+                                         errmsg("could not open file
\"%s\": %m", path))); 
+                /* Not there, so assume no parents */ 
+                return list_make1_int((int) targetTLI); 
+        } 
   still return list_make1_int((int) targetTLI); is used. 

3. Function timelineOfPointInHistory(), should return the timeline of recptr
passed to it.
a. is it okay to decide based on xlog recordpointer that which timeline
it belongs to, as different
timelines can have same xlog recordpointer?
b. it seems from logic that it will return timeline previous to the
timeline of recptr passed.
For example if the timeline 3's switchpoint is equal to recptr passed
then it will return timeline 2.

4. In writeTimeLineHistory function variable endTLI is never used.

5. In header of function writeTimeLineHistory(), can give explanation about
XLogRecPtr switchpoint

6. @@ -6869,11 +5947,35 @@ StartupXLOG(void) 
          */ 
         if (InArchiveRecovery) 
         { 
+                char        reason[200]; 
+ 
+                /* 
+                 * Write comment to history file to explain why and where
timeline 
+                 * changed. Comment varies according to the recovery target
used. 
+                 */ 
+                if (recoveryTarget == RECOVERY_TARGET_XID) 
+                        snprintf(reason, sizeof(reason), 
+                                         "%s transaction %u", 
+                                         recoveryStopAfter ? "after" :
"before", 
+                                         recoveryStopXid); 

In the comment above this line you mentioned why and where timeline changed.

However in the reason field only specifies about where part.

7. + * Returns the redo pointer of the "previous" checkpoint. 
+GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli) 
+{ 
+        if (InRedo) 
+        { 
+                LWLockAcquire(ControlFileLock, LW_SHARED); 
+                *oldrecptr = ControlFile->checkPointCopy.redo; 
+                *oldtli = ControlFile->checkPointCopy.ThisTimeLineID; 
+                LWLockRelease(ControlFileLock); 
+        } 

a. In this function, is it required to take ControlFileLock as earlier also
there was no lock to protect this read
when it get called from RestoreArchivedFile, and I think at this point no
one else can modify these values.
However for code consistency purpose like whenever or wherever read the
controlfile values, read it with read lock.

b. As per your comment it should have returned "previous" checkpoint,
however the function returns values of
latest checkpoint.

8. In function writeTimeLineHistoryFile(), will it not be better to directly
write rather than to later do pg_fsync().
as it is just one time write.

9. +XLogRecPtr 
+timeLineSwitchPoint(XLogRecPtr startpoint, TimeLineID tli) 
.. 
.. 
+                         * starting point. This is because the client can
legimately 
spelling of legitimately needs to be corrected. 
10.+XLogRecPtr 
+timeLineSwitchPoint(XLogRecPtr startpoint, TimeLineID tli) 
.. 
.. 
+ if (tli < ThisTimeLineID) 
+        { 
+                if (!nexttle) 
+                        elog(ERROR, "could not find history entry for child
of timeline %u", tli); /* shouldn't happen */ 
+        } 

I don't understand the meaning of the above check, as I think this situation
can occur
when this function gets called from StartReplication, because always tli
sent by standby to new
master will be less than ThisTimeLineID and it can be first in list.

Documentation
---------------
1. In explanation of TIMELINE_HISTORY:
Filename of the timeline history file. This is always of the form
[insert correct example here].
Give example.
2. In protocol.sgml change, I feel better explain when the COPYDONE message
will be initiated.

With Regards,
Amit Kapila.

#21Amit kapila
amit.kapila@huawei.com
In reply to: Amit Kapila (#20)
Re: Switching timeline over streaming replication

On Friday, September 28, 2012 6:38 PM Amit Kapila wrote:
On Tuesday, September 25, 2012 6:29 PM Heikki Linnakangas wrote:
On 25.09.2012 10:08, Heikki Linnakangas wrote:

On 24.09.2012 16:33, Amit Kapila wrote:

In any case, it will be better if you can split it into multiple

patches:

1. Having new functionality of "Switching timeline over streaming
replication"
2. Refactoring related changes.

Ok, here you go. xlog-c-split-1.patch contains the refactoring of existing

code, with no user-visible changes.

streaming-tli-switch-2.patch applies over xlog-c-split-1.patch, and

contains the new functionality.

Please find the initial review of the patch. Still more review is pending,
but I thought whatever is done I shall post

Some more review:

11. In function readTimeLineHistory()
ereport(DEBUG3,
(errmsg_internal("history of timeline %u is %s",
targetTLI, nodeToString(result))));

Don't nodeToString(result) needs to be changed as it contain list of structure TimeLineHistoryEntry

12. In function @@ -3768,6 +3773,8 @@ rescanLatestTimeLine(void)
+ * The current timeline must be found in the history file, and the
+ * next timeline must've forked off from it *after* the current
+ * recovery location.
  */
- if (!list_member_int(newExpectedTLIs,
- (int) recoveryTargetTLI))
- ereport(LOG,
- (errmsg("new timeline %u is not a child of database system timeline %u",
- newtarget,
- ThisTimeLineID)));

is there any logic in the current patch which ensures that above check is not require now?

13. In function @@ -3768,6 +3773,8 @@ rescanLatestTimeLine(void)
+ found = false;
+ foreach (cell, newExpectedTLIs)
..
..
- list_free(expectedTLIs);
+ list_free_deep(expectedTLIs);
    whats the use of the found variable and freeing expectedTLIs in loop might cause problem.

14. In function @@ -3768,6 +3773,8 @@ rescanLatestTimeLine(void)
newExpectedTLIs = readTimeLineHistory(newtarget);
Shouldn't this variable be declared as newExpectedTLEs as the list returned by readTimeLineHistory contains target list entry

15. StartupXLOG
/* Now we can determine the list of expected TLIs */
expectedTLIs = readTimeLineHistory(recoveryTargetTLI);

Should expectedTLIs be changed to expectedTLEs as the list returned by readTimeLineHistory contains target list entry

16.@@ -5254,8 +5252,8 @@ StartupXLOG(void)
writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
- curFileTLI, endLogSegNo, reason);
+ curFileTLI, EndRecPtr, reason);
if endLogSegNo is not used here, it needs to be removd from function declaration as well.

17.@@ -5254,8 +5252,8 @@ StartupXLOG(void)
if (InArchiveRecovery)
      ..
      ..
+
+ /* The history file can be archived immediately. */
+ TLHistoryFileName(histfname, ThisTimeLineID);
+ XLogArchiveNotify(histfname);

Shouldn't this be done archive_mode is on. Right now InArchiveRecovery is true even when we do recovery for standby

18. +static bool
+WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, bool fetching_ckpt)
{
..
+ if (XLByteLT(RecPtr, receivedUpto))
+ havedata = true;
+ else
+ {
+ XLogRecPtr latestChunkStart;
+
+ receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
+ if (XLByteLT(RecPtr, receivedUpto) && receiveTLI == curFileTLI)
+ {
+ havedata = true;
+ if (!XLByteLT(RecPtr, latestChunkStart))
+ {
+ XLogReceiptTime = GetCurrentTimestamp();
+ SetCurrentChunkStartTime(XLogReceiptTime);
+ }
+ }
+ else
+ havedata = false;
+ }

In the above logic, it seems there is inconsistency in setting havedata = true;
In the above code in else loop, let us say cond. receiveTLI == curFileTLI is false but XLByteLT(RecPtr, receivedUpto) is true,
then in next round in for loop, the check if (XLByteLT(RecPtr, receivedUpto)) will get true and will set havedata = true;
which seems to be contradictory.

19.

+static bool
+WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, bool fetching_ckpt)
{
..
+ if (PrimaryConnInfo)
+ {
+ XLogRecPtr ptr = fetching_ckpt ? RedoStartLSN : RecPtr;
+ TimeLineID tli = timelineOfPointInHistory(ptr, expectedTLIs);
+
+ if (tli < curFileTLI)

I think in some cases if (tli < curFileTLI) might not make sense, as for case where curFileTLI =0 for randAccess.

20. Function name WaitForWALToBecomeAvailable suggests that it waits for WAL, but it also returns true when trigger file is present,
which can be little misleading.

21. @@ -2411,27 +2411,6 @@ reaper(SIGNAL_ARGS)

a. won't it impact stop of online basebackup functionality because earlier on promotion
I think this code will stop walsenders and basebackup stop will also give error in such cases.

22. @@ -63,10 +66,17 @@ void
 _PG_init(void)
 {
  /* Tell walreceiver how to reach us */
- if (walrcv_connect != NULL || walrcv_receive != NULL ||
- walrcv_send != NULL || walrcv_disconnect != NULL)
+ if (walrcv_connect != NULL || walrcv_identify_system ||
+ walrcv_readtimelinehistoryfile != NULL ||

check for walrcv_identify_system != NULL is missing.

23. write the function header for newly added functions (libpqrcv_startstreaming, libpqrcv_identify_system, ..)

24. In header of function libpqrcv_receive(), *type needs to be removed.
+ * If data was received, returns the length of the data. *type and *buffer

25. 
+timeline_history:
+ K_TIMELINE_HISTORY ICONST
+ {
+ TimeLineHistoryCmd *cmd;
+
+ cmd = makeNode(TimeLineHistoryCmd);
+ cmd->timeline = $2;

can handle invalid timeline error same as for opt_timeline

26.@@ -170,6 +187,7 @@ WalReceiverMain(void)
+ case WALRCV_WAITING:
+ case WALRCV_STREAMING:
/* Shouldn't happen */
elog(PANIC, "walreceiver still running according to shared memory state");
elog message should be changed according to new states.

27.@@ -259,8 +281,11 @@ WalReceiverMain(void)

  /* Load the libpq-specific functions */
  load_file("libpqwalreceiver", false);
- if (walrcv_connect == NULL || walrcv_receive == NULL ||
- walrcv_send == NULL || walrcv_disconnect == NULL)
+ if (walrcv_connect == NULL || walrcv_startstreaming == NULL ||
+ walrcv_endstreaming == NULL ||
+ walrcv_readtimelinehistoryfile == NULL ||
+ walrcv_receive == NULL || walrcv_send == NULL ||
+ walrcv_disconnect == NULL)

check for walrcv_identify_system is missing.

28.
+/*
+ * Check that we're connected to a valid server using the IDENTIFY_SERVER
+ * replication command, and fetch any timeline history files present in the
+ * master but missing from this server's pg_xlog directory.
+ */
+static void
+WalRcvHandShake(TimeLineID startpointTLI)

In the function header the command name should be IDENTIFY_SYSTEM instead of IDENTIFY_SERVER

29. @@ -170,6 +187,7 @@ WalReceiverMain(void)
+ * timeline. In case we've already reached the end of the old timeline,
+ * the server will finish the streaming immediately, and we will
+ * disconnect. If recovery_target_timeline is 'latest', the startup
+ * process will then pg_xlog and find the new history file, bump recovery

a. I think after reaching at end of old timeline rather than discoonect it will start from new timeline,
which will be updated by startup process.
b. The above line seems to be incorrect, "will then (scan/search) pg_xlog"

30. +static void
+WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI)

/*
+ * nudge startup process to notice that we've stopped streaming and are
+ * now waiting for instructions.
+ */
+ WakeupRecovery();
  for (;;)
  {

In this for loop don't we need to check interrupts or postmaster alive or recovery in progress
so that if any other process continues, it should not wait indefinately.

31.+static void
+WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI)

/*
+ * nudge startup process to notice that we've stopped streaming and are
+ * now waiting for instructions.
+ */
+ WakeupRecovery();
  for (;;)
  {
+ SpinLockAcquire(&walrcv->mutex);
+ if (walrcv->walRcvState == WALRCV_STARTING)
  {

I think it can reach WALRCV_STOPPING state also after WALRCV_WAITING from shutdown,
so we should check for that state as well.

32.@@ -170,6 +187,7 @@ WalReceiverMain(void)
{
..
..

+ elog(LOG, "walreceiver ended streaming and awaits new instructions");
+ WalRcvWaitForStartPosition(&startpoint, &startpointTLI);

a. After getting new startpoint and tli, it will go again for WalRcvHandShake(startpointTLI);
so here chances are there, it will again fetch the history files from server which we have
already fetched.
b. Also Identify_system command will run again and get the information such as system identifier,
which is completely redundant at this point.
It fetches tli from primary also which I think can't be changed from what earlier it has fetched.

33. .@@ -170,6 +187,7 @@ WalReceiverMain(void)
+ for (;;)
+ {
+ if (len > 0)
+ XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
+ else if (len == 0)
+ break;
+ else if (len < 0)
+ {
+ ereport(LOG,
+ (errmsg("replication terminated by primary server"),
+ errdetail("End of WAL reached on timeline %u", startpointTLI)));
+ endofwal = true;
+ break;
+ }
+ len = walrcv_receive(0, &buf);
+ }
+
+ /* Let the master know that we received some data. */
+ XLogWalRcvSendReply();
+
+ /*
+ * If we've written some records, flush them to disk and let the
+ * startup process and primary server know about them.
+ */
+ XLogWalRcvFlush(false);

a. In the above code in for loop, when it breaks due to len < 0, there is no need to send reply to master.
b. also when it breaks due to len < 0, there can be 2 reasons, one is end of copy mode or primary server has
disconnected. I think in second case handling should be same as what without this feature.
Not sure if its eventually turning out to be same.

34.
+bool
+WalRcvStreaming(void)
+{

In this function, right now if state is WALRCV_WAITING, then it will return false.
I think waiting is better than starting for the matter of checking if walreceiver is in progress.
or is state WALRCV_WAITING anytime expected when this function is called, if not then we log the
error for invalid state.

With Regards,
Amit Kapila.

#22Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit kapila (#21)
1 attachment(s)
Re: Switching timeline over streaming replication

Thanks for the thorough review! I committed the xlog.c refactoring patch
now. Attached is a new version of the main patch, comments on specific
points below. I didn't adjust the docs per your comments yet, will do
that next.

On 01.10.2012 05:25, Amit kapila wrote:

1. In function readTimeLineHistory(),
two mechanisms are used to fetch timeline from history file
+                sscanf(fline, "%u\t%X/%X", &tli, &switchpoint_hi,
&switchpoint_lo);
+
+                /* expect a numeric timeline ID as first field of line */
+                tli = (TimeLineID) strtoul(ptr, &endptr, 0);
If we use new mechanism, it will not be able to detect error as it is
doing in current case.

Fixed, by checking the return value of sscanf().

2.   In function readTimeLineHistory(),
+        fd = AllocateFile(path, "r");
+        if (fd == NULL)
+        {
+                if (errno != ENOENT)
+                        ereport(FATAL,
+                                        (errcode_for_file_access(),
+                                         errmsg("could not open file
\"%s\": %m", path)));
+                /* Not there, so assume no parents */
+                return list_make1_int((int) targetTLI);
+        }
still return list_make1_int((int) targetTLI); is used.

Fixed.

3. Function timelineOfPointInHistory(), should return the timeline of recptr
passed to it.
a. is it okay to decide based on xlog recordpointer that which timeline
it belongs to, as different
timelines can have same xlog recordpointer?

In a particular timeline, the history is linear, and a given point in
WAL unambiguously has one timeline ID. There might be some other
timelines that branch off at different points, but once you pick a
particular timeline, you can unambiguously trace it all the way to the
beginning of WAL, and tell what the timeline ID of each point in WAL was.

b. it seems from logic that it will return timeline previous to the
timeline of recptr passed.
For example if the timeline 3's switchpoint is equal to recptr passed
then it will return timeline 2.

I expanded the comment in the function a bit, I hope it makes more sense
now.

4. In writeTimeLineHistory function variable endTLI is never used.

Removed.

5. In header of function writeTimeLineHistory(), can give explanation about
XLogRecPtr switchpoint

Added.

6. @@ -6869,11 +5947,35 @@ StartupXLOG(void)
*/
if (InArchiveRecovery)
{
+                char        reason[200];
+
+                /*
+                 * Write comment to history file to explain why and where
timeline
+                 * changed. Comment varies according to the recovery target
used.
+                 */
+                if (recoveryTarget == RECOVERY_TARGET_XID)
+                        snprintf(reason, sizeof(reason),
+                                         "%s transaction %u",
+                                         recoveryStopAfter ? "after" :
"before",
+                                         recoveryStopXid);

In the comment above this line you mentioned why and where timeline changed.

However in the reason field only specifies about where part.

I didn't change this in the patch. I guess it's not obvious, but you can
deduce the 'why' part from the message.

7. + * Returns the redo pointer of the "previous" checkpoint.
+GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli)
+{
+        if (InRedo)
+        {
+                LWLockAcquire(ControlFileLock, LW_SHARED);
+                *oldrecptr = ControlFile->checkPointCopy.redo;
+                *oldtli = ControlFile->checkPointCopy.ThisTimeLineID;
+                LWLockRelease(ControlFileLock);
+        }

a. In this function, is it required to take ControlFileLock as earlier also
there was no lock to protect this read
when it get called from RestoreArchivedFile, and I think at this point no
one else can modify these values.
However for code consistency purpose like whenever or wherever read the
controlfile values, read it with read lock.

Yeah, it's just for the sake of consistency.

b. As per your comment it should have returned "previous" checkpoint,
however the function returns values of
latest checkpoint.

Changed the comment. I wonder if we should be more conservative, and
really keep the WAL back to the "previous" checkpoint, but I won't
change that as part of this patch.

8. In function writeTimeLineHistoryFile(), will it not be better to directly
write rather than to later do pg_fsync().
as it is just one time write.

Not sure I understood this right, but writeTimeLineHistoryFile() needs
to avoid putting a corrupt, e.g incomplete, file in pg_xlog. The same as
writeTimeLineHistory(). That's why the write+fsync+rename dance is needed.

9. +XLogRecPtr
+timeLineSwitchPoint(XLogRecPtr startpoint, TimeLineID tli)
..
..
+                         * starting point. This is because the client can
legimately
spelling of legitimately needs to be corrected.

Fixed.

10.+XLogRecPtr
+timeLineSwitchPoint(XLogRecPtr startpoint, TimeLineID tli)
..
..
+ if (tli < ThisTimeLineID)
+        {
+                if (!nexttle)
+                        elog(ERROR, "could not find history entry for child
of timeline %u", tli); /* shouldn't happen */
+        }

I don't understand the meaning of the above check, as I think this
situation can occur when this function gets called from StartReplication,
because always tli sent by standby to new master will be less than
ThisTimeLineID and it can be first in list.

Note that the list is in newest-first order. Ie. the last line in the
history file is first in the list. The first entry in the list is always
for ThisTimeLineID, which is why the (tli < ThisTimeLineID && !nexttle)
combination isn't possible.

11. In function readTimeLineHistory()
ereport(DEBUG3,
(errmsg_internal("history of timeline %u is %s",
targetTLI, nodeToString(result))));

Don't nodeToString(result) needs to be changed as it contain list of structure TimeLineHistoryEntry

Yep. Since this is just a DEBUG3, I'll just remove that, rather than add
the extra code needed to keep the output.

12. In function @@ -3768,6 +3773,8 @@ rescanLatestTimeLine(void)
+ * The current timeline must be found in the history file, and the
+ * next timeline must've forked off from it *after* the current
+ * recovery location.
*/
- if (!list_member_int(newExpectedTLIs,
- (int) recoveryTargetTLI))
- ereport(LOG,
- (errmsg("new timeline %u is not a child of database system timeline %u",
- newtarget,
- ThisTimeLineID)));

is there any logic in the current patch which ensures that above check is not require now?

13. In function @@ -3768,6 +3773,8 @@ rescanLatestTimeLine(void)
+ found = false;
+ foreach (cell, newExpectedTLIs)
..
..
- list_free(expectedTLIs);
+ list_free_deep(expectedTLIs);
whats the use of the found variable and freeing expectedTLIs in loop might cause problem.

Oops, there's some code missing there. Apparently I botched that at some
point while splitting the patch into two. Fixed.

14. In function @@ -3768,6 +3773,8 @@ rescanLatestTimeLine(void)
newExpectedTLIs = readTimeLineHistory(newtarget);
Shouldn't this variable be declared as newExpectedTLEs as the list returned by readTimeLineHistory contains target list entry

15. StartupXLOG
/* Now we can determine the list of expected TLIs */
expectedTLIs = readTimeLineHistory(recoveryTargetTLI);

Should expectedTLIs be changed to expectedTLEs as the list returned by readTimeLineHistory contains target list entry

Makes sense, renamed these two.

16.@@ -5254,8 +5252,8 @@ StartupXLOG(void)
writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
- curFileTLI, endLogSegNo, reason);
+ curFileTLI, EndRecPtr, reason);
if endLogSegNo is not used here, it needs to be removd from function declaration as well.

I didn't understand this one. endLogSegNo is still used earlier in
StartupXLOG.

17.@@ -5254,8 +5252,8 @@ StartupXLOG(void)
if (InArchiveRecovery)
..
..
+
+ /* The history file can be archived immediately. */
+ TLHistoryFileName(histfname, ThisTimeLineID);
+ XLogArchiveNotify(histfname);

Shouldn't this be done archive_mode is on. Right now InArchiveRecovery is true even when we do recovery for standby

Fixed.

18. +static bool
+WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, bool fetching_ckpt)
{
..
+ if (XLByteLT(RecPtr, receivedUpto))
+ havedata = true;
+ else
+ {
+ XLogRecPtr latestChunkStart;
+
+ receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart,&receiveTLI);
+ if (XLByteLT(RecPtr, receivedUpto)&&  receiveTLI == curFileTLI)
+ {
+ havedata = true;
+ if (!XLByteLT(RecPtr, latestChunkStart))
+ {
+ XLogReceiptTime = GetCurrentTimestamp();
+ SetCurrentChunkStartTime(XLogReceiptTime);
+ }
+ }
+ else
+ havedata = false;
+ }

In the above logic, it seems there is inconsistency in setting havedata = true;
In the above code in else loop, let us say cond. receiveTLI == curFileTLI is false but XLByteLT(RecPtr, receivedUpto) is true,
then in next round in for loop, the check if (XLByteLT(RecPtr, receivedUpto)) will get true and will set havedata = true;
which seems to be contradictory.

Hmm, I think you're saying that we should check that receiveTLI ==
curFileTLI also in the first if-statement above. Did that.

19.

+static bool
+WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, bool fetching_ckpt)
{
..
+ if (PrimaryConnInfo)
+ {
+ XLogRecPtr ptr = fetching_ckpt ? RedoStartLSN : RecPtr;
+ TimeLineID tli = timelineOfPointInHistory(ptr, expectedTLIs);
+
+ if (tli<  curFileTLI)

I think in some cases if (tli< curFileTLI) might not make sense, as for case where curFileTLI =0 for randAccess.

Well, if curFileTLI == 0, then it's surely < tli. I think that's the
correct behavior, but I'll add an explicit check for randAccess to make
it more explicit.

20. Function name WaitForWALToBecomeAvailable suggests that it waits for WAL, but it also returns true when trigger file is present,
which can be little misleading.

Added a comment above the function to clarify that.

21. @@ -2411,27 +2411,6 @@ reaper(SIGNAL_ARGS)

a. won't it impact stop of online basebackup functionality because earlier on promotion
I think this code will stop walsenders and basebackup stop will also give error in such cases.

Hmm, should a base backup be aborted when the standby is promoted? Does
the promotion render the backup corrupt?

22. @@ -63,10 +66,17 @@ void
_PG_init(void)
{
/* Tell walreceiver how to reach us */
- if (walrcv_connect != NULL || walrcv_receive != NULL ||
- walrcv_send != NULL || walrcv_disconnect != NULL)
+ if (walrcv_connect != NULL || walrcv_identify_system ||
+ walrcv_readtimelinehistoryfile != NULL ||

check for walrcv_identify_system != NULL is missing.

Fixed.

23. write the function header for newly added functions (libpqrcv_startstreaming, libpqrcv_identify_system, ..)

Fixed.

24. In header of function libpqrcv_receive(), *type needs to be removed.
+ * If data was received, returns the length of the data. *type and *buffer

Fixed.

25.
+timeline_history:
+ K_TIMELINE_HISTORY ICONST
+ {
+ TimeLineHistoryCmd *cmd;
+
+ cmd = makeNode(TimeLineHistoryCmd);
+ cmd->timeline = $2;

can handle invalid timeline error same as for opt_timeline

Fixed.

26.@@ -170,6 +187,7 @@ WalReceiverMain(void)
+ case WALRCV_WAITING:
+ case WALRCV_STREAMING:
/* Shouldn't happen */
elog(PANIC, "walreceiver still running according to shared memory state");
elog message should be changed according to new states.

I think it's ok as is. Both 'waiting' and 'streaming' can be thought of
as 'running'. WalRcvRunning() also returns true for both states.

27.@@ -259,8 +281,11 @@ WalReceiverMain(void)

/* Load the libpq-specific functions */
load_file("libpqwalreceiver", false);
- if (walrcv_connect == NULL || walrcv_receive == NULL ||
- walrcv_send == NULL || walrcv_disconnect == NULL)
+ if (walrcv_connect == NULL || walrcv_startstreaming == NULL ||
+ walrcv_endstreaming == NULL ||
+ walrcv_readtimelinehistoryfile == NULL ||
+ walrcv_receive == NULL || walrcv_send == NULL ||
+ walrcv_disconnect == NULL)

check for walrcv_identify_system is missing.

Fixed.

28.
+/*
+ * Check that we're connected to a valid server using the IDENTIFY_SERVER
+ * replication command, and fetch any timeline history files present in the
+ * master but missing from this server's pg_xlog directory.
+ */
+static void
+WalRcvHandShake(TimeLineID startpointTLI)

In the function header the command name should be IDENTIFY_SYSTEM instead of IDENTIFY_SERVER

Fixed.

29. @@ -170,6 +187,7 @@ WalReceiverMain(void)
+ * timeline. In case we've already reached the end of the old timeline,
+ * the server will finish the streaming immediately, and we will
+ * disconnect. If recovery_target_timeline is 'latest', the startup
+ * process will then pg_xlog and find the new history file, bump recovery

a. I think after reaching at end of old timeline rather than discoonect it will start from new timeline,

to> which will be updated by startup process.

b. The above line seems to be incorrect, "will then (scan/search) pg_xlog"

Fixed the comment.

30. +static void
+WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI)

/*
+ * nudge startup process to notice that we've stopped streaming and are
+ * now waiting for instructions.
+ */
+ WakeupRecovery();
for (;;)
{

In this for loop don't we need to check interrupts or postmaster alive or recovery in progress
so that if any other process continues, it should not wait indefinately.

Added a ProcessWalRcvInterrupts() check. There is a PostmasterIsAlive()
check there already.

31.+static void
+WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI)

/*
+ * nudge startup process to notice that we've stopped streaming and are
+ * now waiting for instructions.
+ */
+ WakeupRecovery();
for (;;)
{
+ SpinLockAcquire(&walrcv->mutex);
+ if (walrcv->walRcvState == WALRCV_STARTING)
{

I think it can reach WALRCV_STOPPING state also after WALRCV_WAITING from shutdown,
so we should check for that state as well.

Added that, although we should receive the SIGTERM when the startup
process wants the walreceiver to die.

32.@@ -170,6 +187,7 @@ WalReceiverMain(void)
{
..
..

+ elog(LOG, "walreceiver ended streaming and awaits new instructions");
+ WalRcvWaitForStartPosition(&startpoint,&startpointTLI);

a. After getting new startpoint and tli, it will go again for WalRcvHandShake(startpointTLI);
so here chances are there, it will again fetch the history files from server which we have
already fetched.

WalRcvHandShake() only fetches history files that don't exist in pg_xlog
already.

b. Also Identify_system command will run again and get the information such as system identifier,
which is completely redundant at this point.
It fetches tli from primary also which I think can't be changed from what earlier it has fetched.

The server's tli might've changed in cascading replication, where the
server is also a standby running with recovery_target_timeline='latest'.
It's fairly unlikely if we just ran the Identify_system command, but I'm
inclined to keep that. One extra roundtrip isn't that bad, and I think
it'd complicate the logic to try to avoid that.

33. .@@ -170,6 +187,7 @@ WalReceiverMain(void)
+ for (;;)
+ {
+ if (len>  0)
+ XLogWalRcvProcessMsg(buf[0],&buf[1], len - 1);
+ else if (len == 0)
+ break;
+ else if (len<  0)
+ {
+ ereport(LOG,
+ (errmsg("replication terminated by primary server"),
+ errdetail("End of WAL reached on timeline %u", startpointTLI)));
+ endofwal = true;
+ break;
+ }
+ len = walrcv_receive(0,&buf);
+ }
+
+ /* Let the master know that we received some data. */
+ XLogWalRcvSendReply();
+
+ /*
+ * If we've written some records, flush them to disk and let the
+ * startup process and primary server know about them.
+ */
+ XLogWalRcvFlush(false);

a. In the above code in for loop, when it breaks due to len< 0, there is no need to send reply to master.

Well, I think it's prudent to send one more reply at the end of streaming.

b. also when it breaks due to len< 0, there can be 2 reasons, one is end of copy mode or primary server has
disconnected. I think in second case handling should be same as what without this feature.
Not sure if its eventually turning out to be same.

No, libpq_receive() will throw an error if the connection is broken for
some reason. It only returns -1 at end-of-copy.

34.
+bool
+WalRcvStreaming(void)
+{

In this function, right now if state is WALRCV_WAITING, then it will return false.
I think waiting is better than starting for the matter of checking if walreceiver is in progress.
or is state WALRCV_WAITING anytime expected when this function is called, if not then we log the
error for invalid state.

It's normal to call WalRcvStreaming() when it's in waiting mode.
WalRcvStreaming() is called from WaitForWALToBecomeAvailable, always,
regardless of whether streaming replication is even enabled.

Thanks again for the detailed review!

- Heikki

Attachments:

streaming-tli-switch-3.patch.gzapplication/x-gzip; name=streaming-tli-switch-3.patch.gzDownload
#23Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#22)
Re: Switching timeline over streaming replication

On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote:

Thanks for the thorough review! I committed the xlog.c refactoring patch
now. Attached is a new version of the main patch, comments on specific
points below. I didn't adjust the docs per your comments yet, will do
that next.

I have some doubts regarding the comments fixed by you and some more new
review comments.
After this I shall focus majorly towards testing of this Patch.

On 01.10.2012 05:25, Amit kapila wrote:

1. In function readTimeLineHistory(),
two mechanisms are used to fetch timeline from history file
+                sscanf(fline, "%u\t%X/%X", &tli, &switchpoint_hi,
&switchpoint_lo);
+

8. In function writeTimeLineHistoryFile(), will it not be better to
directly write rather than to later do pg_fsync().
as it is just one time write.

Not sure I understood this right, but writeTimeLineHistoryFile() needs
to avoid putting a corrupt, e.g incomplete, file in pg_xlog. The same as
writeTimeLineHistory(). That's why the write+fsync+rename dance is
needed.

Why fsync, isn't the above purpose be resolved if write directly writes to
file and then rename.

21. @@ -2411,27 +2411,6 @@ reaper(SIGNAL_ARGS)

a. won't it impact stop of online basebackup functionality

because earlier on promotion

I think this code will stop walsenders and basebackup stop

will also give error in such cases.

Hmm, should a base backup be aborted when the standby is promoted? Does
the promotion render the backup corrupt?

I think currently it does so. Pls refer
1.
do_pg_stop_backup(char *labelfile, bool waitforarchive)
{
..
if (strcmp(backupfrom, "standby") == 0 && !backup_started_in_recovery)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("the standby was promoted during
online backup"),
errhint("This means that the backup being
taken is corrupt "
"and should not be used. "
"Try taking another online
backup.")));
..

}

2. In documentation of pg_basebackup there is a Note:
.If the standby is promoted to the master during online backup, the backup
fails.

New Ones
---------------
35.WalSenderMain(void) 
{ 
.. 
+                if (walsender_shutdown_requested) 
+                        ereport(FATAL, 
+                                        (errcode(ERRCODE_ADMIN_SHUTDOWN), 
+                                         errmsg("terminating replication
connection due to administrator command"))); 
+ 
+                /* Tell the client that we are ready to receive commands */
+                ReadyForQuery(DestRemote); 
+ 
.. 
+                if (walsender_shutdown_requested) 
+                        ereport(FATAL, 
+                                        (errcode(ERRCODE_ADMIN_SHUTDOWN), 
+                                         errmsg("terminating replication
connection due to administrator command"))); 
+ 

is it necessary to check walsender_shutdown_requested 2 times in a loop, if
yes, then
can we write comment why it is important to check it again.

35. +SendTimeLineHistory(TimeLineHistoryCmd *cmd)
{
..
+ fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY, 0666);

error handling for fd < 0 is missing.

 36.+SendTimeLineHistory(TimeLineHistoryCmd *cmd) 
 { 
 .. 
 if (nread <= 0) 
+                        ereport(ERROR, 
+                                        (errcode_for_file_access(), 
+                                         errmsg("could not read file
\"%s\": %m", 
+                                                        path))); 

FileClose should be done in error case as well.

37. static void
XLogSend(char *msgbuf, bool *caughtup)
{
..
if (currentTimeLineIsHistoric && XLByteLE(currentTimeLineValidUpto,
sentPtr))
{
/*
* This was a historic timeline, and we've reached
the point where
* we switched to the next timeline. Let the client
know we've
* reached the end of this timeline, and what the
next timeline is.
*/
/* close the current file. */
if (sendFile >= 0)
close(sendFile);
sendFile = -1;
*caughtup = true;

/* Send CopyDone */
pq_putmessage_noblock('c', NULL, 0);
streamingDoneSending = true;
return;
}
}

I am not able to understand after sending CopyDone message from above code,
how walreceiver is handling it and then replying it a CopyDone message.
Basically I want to know the walreceiver code which handles it?

38.
static void
WalSndLoop(void)
{
@@ -756,18 +898,24 @@ WalSndLoop(void)

                 /* Normal exit from the walsender is here */ 
                 if (walsender_shutdown_requested) 
-                { 
-                        /* Inform the standby that XLOG streaming is done
*/ 
-                        pq_puttextmessage('C', "COPY 0"); 
-                        pq_flush(); 
- 
-                        proc_exit(0); 
-                } 
+                        ereport(FATAL, 
+                                        (errcode(ERRCODE_ADMIN_SHUTDOWN), 
+                                         errmsg("terminating replication
connection due to administrator command"))); 

What is the reason of removal of sending above message to standby when
shutdown was requested?

39. WalSndLoop(void)
{
..
/* If nothing remains to be sent right now ... */
if (caughtup && !pq_is_send_pending())
{
/*
* If we're in catchup state, move to streaming.
This is an
* important state change for users to know about,
since before
* this point data loss might occur if the primary
dies and we
* need to failover to the standby. The state change
is also
* important for synchronous replication, since
commits that
* started to wait at that point might wait for some
time.
*/
if (MyWalSnd->state == WALSNDSTATE_CATCHUP)
{
ereport(DEBUG1,
(errmsg("standby \"%s\" has now
caught up with primary",

application_name)));
WalSndSetState(WALSNDSTATE_STREAMING);
}
..
}

After new implementation, I think above if loop [if (caughtup &&
!pq_is_send_pending())] can be true
when the standby has not actually caught up as now it sends the data from
previous timelines.

With Regards,
Amit Kapila.

#24Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#23)
Promoting a standby during base backup (was Re: Switching timeline over streaming replication)

On 03.10.2012 18:15, Amit Kapila wrote:

On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote:

Hmm, should a base backup be aborted when the standby is promoted? Does
the promotion render the backup corrupt?

I think currently it does so. Pls refer
1.
do_pg_stop_backup(char *labelfile, bool waitforarchive)
{
..
if (strcmp(backupfrom, "standby") == 0&& !backup_started_in_recovery)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("the standby was promoted during
online backup"),
errhint("This means that the backup being
taken is corrupt "
"and should not be used. "
"Try taking another online
backup.")));
..

}

Okay. I think that check in do_pg_stop_backup() actually already ensures
that you don't end up with a corrupt backup, even if the standby is
promoted while a backup is being taken. Admittedly it would be nicer to
abort it immediately rather than error out at the end.

But I wonder why promoting a standby renders the backup invalid in the
first place? Fujii, Simon, can you explain that?

- Heikki

#25Amit Kapila
amit.kapila@huawei.com
In reply to: Amit Kapila (#23)
Re: Switching timeline over streaming replication

On Wednesday, October 03, 2012 8:45 PM Heikki Linnakangas wrote:
On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote:

Thanks for the thorough review! I committed the xlog.c refactoring

patch

now. Attached is a new version of the main patch, comments on specific
points below. I didn't adjust the docs per your comments yet, will do
that next.

I have some doubts regarding the comments fixed by you and some more new
review comments.
After this I shall focus majorly towards testing of this Patch.

Testing
-----------

Failed Case
--------------
1. promotion of standby to master and follow standby to new master.
2. Stop standby and master. Restart standby first and then master
3. Restart of standby gives below errors
E:\pg_git_code\installation\bin>LOG: database system was shut down in
recovery
at 2012-10-04 18:36:00 IST
LOG: entering standby mode
LOG: consistent recovery state reached at 0/176B800
LOG: redo starts at 0/176B800
LOG: record with zero length at 0/176BD68
LOG: database system is ready to accept read only connections
LOG: streaming replication successfully connected to primary
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000
00001, offset 0
FATAL: terminating walreceiver process due to administrator command
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000
00001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000
00001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000
00001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000
00001, offset 0

Once this error comes, restart master/standby in any order or do some
operations on master, always there is above error
On standby.

Passed Cases
-------------
1. After promoting standby as new master, try to make previous master
(having same WAL as new master) as standby.
In this case recovery.conf recovery_target_timeline set to latest. It
ables to connect to new master and started
streaming as per expectation.
- As per expected behavior.

2. After promoting standby as new master, try to make previous master
(having more WAL compare to new master) as standby,
error is displayed.
- As per expected behavior

3. After promoting standby as new master, try to make previous master
(having same WAL as new master) as standby.
In this case recovery.conf recovery_target_timeline is not set. Following
LOG is displayed.
LOG: fetching timeline history file for timeline 2 from primary server
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
LOG: re-handshaking at position 0/1000000 on tli 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
LOG: re-handshaking at position 0/1000000 on tli 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
- As per expected behavior

Pending Cases which needs to be tested (these are scenarios, some more
testing I will do based on these scenarios)
---------------------------------------
1. a. Master M-1
b. Standby S-1 follows M-1
c. Standby S-2 follows M-1
d. Promote S-1 as master
e. Try to follow S-2 to S-1 -- operation should be success

2. a. Master M-1
b. Standby S-1 follows M-1
c. Stop S-1, M-1
d. Do the PITR in M-1 2 times. This is to increment timeline in M-1
e. try to follow standby S-1 to M-1 -- it should be success.

3. a. Master M-1
b. Standby S-1, S-2 follows M1
c. Standby S-3, S-4 follows S-1
d. Promote Standby which has highest WAL.
e. follow all standby's to the new master.

4. a. Master M-1
b. Synchronous Standby S-1, S-2
c. Promote S-1
d. Follow M-1, S-2 to S-1 -- this operation should be success.

Concurrent Operations
---------------------------
1. a. Master M-1 , Standby S-1 follows M-1, Standby S-2 follows M-1
b. Many concurrent operations on master M-1
c. During concurrent ops, Promote S-1
d. try S-2 to follow S-1 -- it should happen successfully.

2. During Promotion, call pg_basebackup

3. During Promotion, try to connect client

Resource Testing
------------------
1. a.Make standby follow master which is many time lines ahead
b. Observe if there is any resource leak
c. Allow the streaming replication for 30 mins
d. Observe if there is any resource leak

Code Review
-------------
Libpqrcv_readtimelinehistoryfile()
{
  ..
  ..
+       if (PQnfields(res) != 2 || PQntuples(res) != 1) 
+       { 
+               int                     ntuples = PQntuples(res); 
+               int                     nfields = PQnfields(res); 
+ 
+               PQclear(res); 
+               ereport(ERROR, 
+                               (errmsg("invalid response from primary
server"), 
+                                errdetail("Expected 1 tuple with 3 fields,
got %d tuples with %d fields.", 
+                                                  ntuples, nfields))); 
+       }

..
}

The error message is saying 3 fields needs to be read in timeline history,
but the check seems to be is done for 2 fields.

Kindly let me know if you want me to focus on any other areas for testing
this feature.

With Regards,
Amit Kapila.

#26Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#23)
1 attachment(s)
Sharing more infrastructure between walsenders and regular backends (was Re: Switching timeline over streaming replication)

On 03.10.2012 18:15, Amit Kapila wrote:

35.WalSenderMain(void)
{
..
+                if (walsender_shutdown_requested)
+                        ereport(FATAL,
+                                        (errcode(ERRCODE_ADMIN_SHUTDOWN),
+                                         errmsg("terminating replication
connection due to administrator command")));
+
+                /* Tell the client that we are ready to receive commands */
+                ReadyForQuery(DestRemote);
+
..
+                if (walsender_shutdown_requested)
+                        ereport(FATAL,
+                                        (errcode(ERRCODE_ADMIN_SHUTDOWN),
+                                         errmsg("terminating replication
connection due to administrator command")));
+

is it necessary to check walsender_shutdown_requested 2 times in a
loop, if yes, then can we write comment why it is important to check
it again.

The idea was to check for shutdown request before and after the
pq_getbyte() call, because that can block for a long time.

Looking closer, we don't currently (ie. without this patch) make any
effort to react to SIGTERM in a walsender, while it's waiting for a
command from the client. After starting replication, it does check
walsender_shutdown_requested in the loop, and it's also checked during a
base backup (although only when switching to send next file, which seems
too seldom). This issue is orthogonal to handling timeline changes over
streaming replication, although that patch will make it more important
to handle SIGTERM quickly while waiting for a command, because you stay
in that mode for longer and more often.

I think walsender needs to share more infrastructure with regular
backends to handle this better. When we first implemented streaming
replication in 9.0, it made sense to implement just the bare minimum
needed to accept the handshake commands before entering the Copy state,
but now that the replication command set has grown to cover base
backups, and fetching timelines with the patch being discussed, we
should bite the bullet and make the command loop more feature-complete
and robust.

In a regular backend, the command loop reacts to SIGTERM immediately,
setting ImmediateInterruptOK at the right places, and calling
CHECK_FOR_INTERRUPTS() at strategic places. I propose that we let
PostgresMain handle the main command loop for walsender processes too,
like it does for regular backends, and use ProcDiePending and the
regular die() signal handler for SIGTERM in walsender as well.

So I propose the attached patch. I made small changes to postgres.c to
make it call exec_replication_command() instead of exec_simple_query(),
and reject extend query protocol, in a WAL sender process. A lot of code
related to handling the main command loop and signals is removed from
walsender.c.

- Heikki

Attachments:

use-main-command-loop-in-walsender-1.patchtext/x-diff; name=use-main-command-loop-in-walsender-1.patchDownload
*** a/src/backend/replication/basebackup.c
--- b/src/backend/replication/basebackup.c
***************
*** 22,27 ****
--- 22,28 ----
  #include "lib/stringinfo.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
+ #include "miscadmin.h"
  #include "nodes/pg_list.h"
  #include "replication/basebackup.h"
  #include "replication/walsender.h"
***************
*** 30,36 ****
  #include "storage/ipc.h"
  #include "utils/builtins.h"
  #include "utils/elog.h"
- #include "utils/memutils.h"
  #include "utils/ps_status.h"
  
  typedef struct
--- 31,36 ----
***************
*** 370,388 **** void
  SendBaseBackup(BaseBackupCmd *cmd)
  {
  	DIR		   *dir;
- 	MemoryContext backup_context;
- 	MemoryContext old_context;
  	basebackup_options opt;
  
  	parse_basebackup_options(cmd->options, &opt);
  
- 	backup_context = AllocSetContextCreate(CurrentMemoryContext,
- 										   "Streaming base backup context",
- 										   ALLOCSET_DEFAULT_MINSIZE,
- 										   ALLOCSET_DEFAULT_INITSIZE,
- 										   ALLOCSET_DEFAULT_MAXSIZE);
- 	old_context = MemoryContextSwitchTo(backup_context);
- 
  	WalSndSetState(WALSNDSTATE_BACKUP);
  
  	if (update_process_title)
--- 370,379 ----
***************
*** 403,411 **** SendBaseBackup(BaseBackupCmd *cmd)
  	perform_base_backup(&opt, dir);
  
  	FreeDir(dir);
- 
- 	MemoryContextSwitchTo(old_context);
- 	MemoryContextDelete(backup_context);
  }
  
  static void
--- 394,399 ----
***************
*** 606,612 **** sendDir(char *path, int basepathlen, bool sizeonly)
  		 * error in that case. The error handler further up will call
  		 * do_pg_abort_backup() for us.
  		 */
! 		if (walsender_shutdown_requested || walsender_ready_to_stop)
  			ereport(ERROR,
  				(errmsg("shutdown requested, aborting active base backup")));
  
--- 594,600 ----
  		 * error in that case. The error handler further up will call
  		 * do_pg_abort_backup() for us.
  		 */
! 		if (ProcDiePending || walsender_ready_to_stop)
  			ereport(ERROR,
  				(errmsg("shutdown requested, aborting active base backup")));
  
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 113,133 **** static TimestampTz last_reply_timestamp;
  
  /* Flags set by signal handlers for later service in main loop */
  static volatile sig_atomic_t got_SIGHUP = false;
- volatile sig_atomic_t walsender_shutdown_requested = false;
  volatile sig_atomic_t walsender_ready_to_stop = false;
  
  /* Signal handlers */
  static void WalSndSigHupHandler(SIGNAL_ARGS);
- static void WalSndShutdownHandler(SIGNAL_ARGS);
- static void WalSndQuickDieHandler(SIGNAL_ARGS);
  static void WalSndXLogSendHandler(SIGNAL_ARGS);
  static void WalSndLastCycleHandler(SIGNAL_ARGS);
  
  /* Prototypes for private functions */
- static bool HandleReplicationCommand(const char *cmd_string);
  static void WalSndLoop(void) __attribute__((noreturn));
! static void InitWalSnd(void);
! static void WalSndHandshake(void);
  static void WalSndKill(int code, Datum arg);
  static void XLogSend(char *msgbuf, bool *caughtup);
  static void IdentifySystem(void);
--- 113,128 ----
  
  /* Flags set by signal handlers for later service in main loop */
  static volatile sig_atomic_t got_SIGHUP = false;
  volatile sig_atomic_t walsender_ready_to_stop = false;
  
  /* Signal handlers */
  static void WalSndSigHupHandler(SIGNAL_ARGS);
  static void WalSndXLogSendHandler(SIGNAL_ARGS);
  static void WalSndLastCycleHandler(SIGNAL_ARGS);
  
  /* Prototypes for private functions */
  static void WalSndLoop(void) __attribute__((noreturn));
! static void InitWalSenderSlot(void);
  static void WalSndKill(int code, Datum arg);
  static void XLogSend(char *msgbuf, bool *caughtup);
  static void IdentifySystem(void);
***************
*** 139,284 **** static void ProcessRepliesIfAny(void);
  static void WalSndKeepalive(char *msgbuf);
  
  
! /* Main entry point for walsender process */
  void
! WalSenderMain(void)
  {
- 	MemoryContext walsnd_context;
- 
  	am_cascading_walsender = RecoveryInProgress();
  
  	/* Create a per-walsender data structure in shared memory */
! 	InitWalSnd();
! 
! 	/*
! 	 * Create a memory context that we will do all our work in.  We do this so
! 	 * that we can reset the context during error recovery and thereby avoid
! 	 * possible memory leaks.  Formerly this code just ran in
! 	 * TopMemoryContext, but resetting that would be a really bad idea.
! 	 *
! 	 * XXX: we don't actually attempt error recovery in walsender, we just
! 	 * close the connection and exit.
! 	 */
! 	walsnd_context = AllocSetContextCreate(TopMemoryContext,
! 										   "Wal Sender",
! 										   ALLOCSET_DEFAULT_MINSIZE,
! 										   ALLOCSET_DEFAULT_INITSIZE,
! 										   ALLOCSET_DEFAULT_MAXSIZE);
! 	MemoryContextSwitchTo(walsnd_context);
  
  	/* Set up resource owner */
  	CurrentResourceOwner = ResourceOwnerCreate(NULL, "walsender top-level resource owner");
  
- 	/* Unblock signals (they were blocked when the postmaster forked us) */
- 	PG_SETMASK(&UnBlockSig);
- 
  	/*
  	 * Use the recovery target timeline ID during recovery
  	 */
  	if (am_cascading_walsender)
  		ThisTimeLineID = GetRecoveryTargetTLI();
- 
- 	/* Tell the standby that walsender is ready for receiving commands */
- 	ReadyForQuery(DestRemote);
- 
- 	/* Handle handshake messages before streaming */
- 	WalSndHandshake();
- 
- 	/* Initialize shared memory status */
- 	{
- 		/* use volatile pointer to prevent code rearrangement */
- 		volatile WalSnd *walsnd = MyWalSnd;
- 
- 		SpinLockAcquire(&walsnd->mutex);
- 		walsnd->sentPtr = sentPtr;
- 		SpinLockRelease(&walsnd->mutex);
- 	}
- 
- 	SyncRepInitConfig();
- 
- 	/* Main loop of walsender */
- 	WalSndLoop();
  }
  
  /*
!  * Execute commands from walreceiver, until we enter streaming mode.
   */
! static void
! WalSndHandshake(void)
  {
! 	StringInfoData input_message;
! 	bool		replication_started = false;
! 
! 	initStringInfo(&input_message);
! 
! 	while (!replication_started)
  	{
! 		int			firstchar;
! 
! 		WalSndSetState(WALSNDSTATE_STARTUP);
! 		set_ps_display("idle", false);
! 
! 		/* Wait for a command to arrive */
! 		firstchar = pq_getbyte();
! 
! 		/*
! 		 * Emergency bailout if postmaster has died.  This is to avoid the
! 		 * necessity for manual cleanup of all postmaster children.
! 		 */
! 		if (!PostmasterIsAlive())
! 			exit(1);
! 
! 		/*
! 		 * Check for any other interesting events that happened while we
! 		 * slept.
! 		 */
! 		if (got_SIGHUP)
! 		{
! 			got_SIGHUP = false;
! 			ProcessConfigFile(PGC_SIGHUP);
! 		}
! 
! 		if (firstchar != EOF)
! 		{
! 			/*
! 			 * Read the message contents. This is expected to be done without
! 			 * blocking because we've been able to get message type code.
! 			 */
! 			if (pq_getmessage(&input_message, 0))
! 				firstchar = EOF;	/* suitable message already logged */
! 		}
! 
! 		/* Handle the very limited subset of commands expected in this phase */
! 		switch (firstchar)
! 		{
! 			case 'Q':			/* Query message */
! 				{
! 					const char *query_string;
! 
! 					query_string = pq_getmsgstring(&input_message);
! 					pq_getmsgend(&input_message);
! 
! 					if (HandleReplicationCommand(query_string))
! 						replication_started = true;
! 				}
! 				break;
! 
! 			case 'X':
! 				/* standby is closing the connection */
! 				proc_exit(0);
! 
! 			case EOF:
! 				/* standby disconnected unexpectedly */
! 				ereport(COMMERROR,
! 						(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 						 errmsg("unexpected EOF on standby connection")));
! 				proc_exit(0);
! 
! 			default:
! 				ereport(FATAL,
! 						(errcode(ERRCODE_PROTOCOL_VIOLATION),
! 						 errmsg("invalid standby handshake message type %d", firstchar)));
! 		}
  	}
  }
  
--- 134,172 ----
  static void WalSndKeepalive(char *msgbuf);
  
  
! /* Initialize walsender process before entering the main command loop */
  void
! InitWalSender(void)
  {
  	am_cascading_walsender = RecoveryInProgress();
  
  	/* Create a per-walsender data structure in shared memory */
! 	InitWalSenderSlot();
  
  	/* Set up resource owner */
  	CurrentResourceOwner = ResourceOwnerCreate(NULL, "walsender top-level resource owner");
  
  	/*
  	 * Use the recovery target timeline ID during recovery
  	 */
  	if (am_cascading_walsender)
  		ThisTimeLineID = GetRecoveryTargetTLI();
  }
  
  /*
!  * Clean up after an error.
!  *
!  * WAL sender processes don't use transactions like regular backends do.
!  * This should do any cleanup required in a WAL sender process, similar to
!  * what transaction abort does in a regular backend.
   */
! void
! WalSndErrorCleanup()
  {
! 	if (sendFile >= 0)
  	{
! 		close(sendFile);
! 		sendFile = -1;
  	}
  }
  
***************
*** 350,364 **** IdentifySystem(void)
  	pq_sendbytes(&buf, (char *) xpos, strlen(xpos));
  
  	pq_endmessage(&buf);
- 
- 	/* Send CommandComplete and ReadyForQuery messages */
- 	EndCommand("SELECT", DestRemote);
- 	ReadyForQuery(DestRemote);
- 	/* ReadyForQuery did pq_flush for us */
  }
  
  /*
!  * START_REPLICATION
   */
  static void
  StartReplication(StartReplicationCmd *cmd)
--- 238,250 ----
  	pq_sendbytes(&buf, (char *) xpos, strlen(xpos));
  
  	pq_endmessage(&buf);
  }
  
  /*
!  * Handle START_REPLICATION command.
!  *
!  * At the moment, this never returns, but an ereport(ERROR) will take us back
!  * to the main loop.
   */
  static void
  StartReplication(StartReplicationCmd *cmd)
***************
*** 435,449 **** StartReplication(StartReplicationCmd *cmd)
  	 * be shipped from that position
  	 */
  	sentPtr = cmd->startpoint;
  }
  
  /*
   * Execute an incoming replication command.
   */
! static bool
! HandleReplicationCommand(const char *cmd_string)
  {
- 	bool		replication_started = false;
  	int			parse_rc;
  	Node	   *cmd_node;
  	MemoryContext cmd_context;
--- 321,349 ----
  	 * be shipped from that position
  	 */
  	sentPtr = cmd->startpoint;
+ 
+ 	/* Also update the start position status in shared memory */
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = MyWalSnd;
+ 
+ 		SpinLockAcquire(&walsnd->mutex);
+ 		walsnd->sentPtr = sentPtr;
+ 		SpinLockRelease(&walsnd->mutex);
+ 	}
+ 
+ 	SyncRepInitConfig();
+ 
+ 	/* Main loop of walsender */
+ 	WalSndLoop();
  }
  
  /*
   * Execute an incoming replication command.
   */
! void
! exec_replication_command(const char *cmd_string)
  {
  	int			parse_rc;
  	Node	   *cmd_node;
  	MemoryContext cmd_context;
***************
*** 451,456 **** HandleReplicationCommand(const char *cmd_string)
--- 351,358 ----
  
  	elog(DEBUG1, "received replication command: %s", cmd_string);
  
+ 	CHECK_FOR_INTERRUPTS();
+ 
  	cmd_context = AllocSetContextCreate(CurrentMemoryContext,
  										"Replication command context",
  										ALLOCSET_DEFAULT_MINSIZE,
***************
*** 476,493 **** HandleReplicationCommand(const char *cmd_string)
  
  		case T_StartReplicationCmd:
  			StartReplication((StartReplicationCmd *) cmd_node);
- 
- 			/* break out of the loop */
- 			replication_started = true;
  			break;
  
  		case T_BaseBackupCmd:
  			SendBaseBackup((BaseBackupCmd *) cmd_node);
- 
- 			/* Send CommandComplete and ReadyForQuery messages */
- 			EndCommand("SELECT", DestRemote);
- 			ReadyForQuery(DestRemote);
- 			/* ReadyForQuery did pq_flush for us */
  			break;
  
  		default:
--- 378,387 ----
***************
*** 500,506 **** HandleReplicationCommand(const char *cmd_string)
  	MemoryContextSwitchTo(old_context);
  	MemoryContextDelete(cmd_context);
  
! 	return replication_started;
  }
  
  /*
--- 394,401 ----
  	MemoryContextSwitchTo(old_context);
  	MemoryContextDelete(cmd_context);
  
! 	/* Send CommandComplete message */
! 	EndCommand("SELECT", DestRemote);
  }
  
  /*
***************
*** 754,768 **** WalSndLoop(void)
  			SyncRepInitConfig();
  		}
  
! 		/* Normal exit from the walsender is here */
! 		if (walsender_shutdown_requested)
! 		{
! 			/* Inform the standby that XLOG streaming is done */
! 			pq_puttextmessage('C', "COPY 0");
! 			pq_flush();
! 
! 			proc_exit(0);
! 		}
  
  		/* Check for input from the client */
  		ProcessRepliesIfAny();
--- 649,655 ----
  			SyncRepInitConfig();
  		}
  
! 		CHECK_FOR_INTERRUPTS();
  
  		/* Check for input from the client */
  		ProcessRepliesIfAny();
***************
*** 813,819 **** WalSndLoop(void)
  				XLogSend(output_message, &caughtup);
  				if (caughtup && !pq_is_send_pending())
  				{
! 					walsender_shutdown_requested = true;
  					continue;	/* don't want to wait more */
  				}
  			}
--- 700,706 ----
  				XLogSend(output_message, &caughtup);
  				if (caughtup && !pq_is_send_pending())
  				{
! 					ProcDiePending = true;
  					continue;	/* don't want to wait more */
  				}
  			}
***************
*** 854,861 **** WalSndLoop(void)
--- 741,751 ----
  			}
  
  			/* Sleep until something happens or replication timeout */
+ 			ImmediateInterruptOK = true;
+ 			CHECK_FOR_INTERRUPTS();
  			WaitLatchOrSocket(&MyWalSnd->latch, wakeEvents,
  							  MyProcPort->sock, sleeptime);
+ 			ImmediateInterruptOK = false;
  
  			/*
  			 * Check for replication timeout.  Note we ignore the corner case
***************
*** 892,898 **** WalSndLoop(void)
  
  /* Initialize a per-walsender data structure for this walsender process */
  static void
! InitWalSnd(void)
  {
  	int			i;
  
--- 782,788 ----
  
  /* Initialize a per-walsender data structure for this walsender process */
  static void
! InitWalSenderSlot(void)
  {
  	int			i;
  
***************
*** 1284,1341 **** WalSndSigHupHandler(SIGNAL_ARGS)
  	errno = save_errno;
  }
  
- /* SIGTERM: set flag to shut down */
- static void
- WalSndShutdownHandler(SIGNAL_ARGS)
- {
- 	int			save_errno = errno;
- 
- 	walsender_shutdown_requested = true;
- 	if (MyWalSnd)
- 		SetLatch(&MyWalSnd->latch);
- 
- 	/*
- 	 * Set the standard (non-walsender) state as well, so that we can abort
- 	 * things like do_pg_stop_backup().
- 	 */
- 	InterruptPending = true;
- 	ProcDiePending = true;
- 
- 	errno = save_errno;
- }
- 
- /*
-  * WalSndQuickDieHandler() occurs when signalled SIGQUIT by the postmaster.
-  *
-  * Some backend has bought the farm,
-  * so we need to stop what we're doing and exit.
-  */
- static void
- WalSndQuickDieHandler(SIGNAL_ARGS)
- {
- 	PG_SETMASK(&BlockSig);
- 
- 	/*
- 	 * We DO NOT want to run proc_exit() callbacks -- we're here because
- 	 * shared memory may be corrupted, so we don't want to try to clean up our
- 	 * transaction.  Just nail the windows shut and get out of town.  Now that
- 	 * there's an atexit callback to prevent third-party code from breaking
- 	 * things by calling exit() directly, we have to reset the callbacks
- 	 * explicitly to make this work as intended.
- 	 */
- 	on_exit_reset();
- 
- 	/*
- 	 * Note we do exit(2) not exit(0).	This is to force the postmaster into a
- 	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
- 	 * backend.  This is necessary precisely because we don't clean up our
- 	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
- 	 * should ensure the postmaster sees this as a crash, too, but no harm in
- 	 * being doubly sure.)
- 	 */
- 	exit(2);
- }
- 
  /* SIGUSR1: set flag to send WAL records */
  static void
  WalSndXLogSendHandler(SIGNAL_ARGS)
--- 1174,1179 ----
***************
*** 1368,1375 **** WalSndSignals(void)
  	pqsignal(SIGHUP, WalSndSigHupHandler);		/* set flag to read config
  												 * file */
  	pqsignal(SIGINT, SIG_IGN);	/* not used */
! 	pqsignal(SIGTERM, WalSndShutdownHandler);	/* request shutdown */
! 	pqsignal(SIGQUIT, WalSndQuickDieHandler);	/* hard crash time */
  	InitializeTimeouts();		/* establishes SIGALRM handler */
  	pqsignal(SIGPIPE, SIG_IGN);
  	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
--- 1206,1213 ----
  	pqsignal(SIGHUP, WalSndSigHupHandler);		/* set flag to read config
  												 * file */
  	pqsignal(SIGINT, SIG_IGN);	/* not used */
! 	pqsignal(SIGTERM, die);						/* request shutdown */
! 	pqsignal(SIGQUIT, quickdie);				/* hard crash time */
  	InitializeTimeouts();		/* establishes SIGALRM handler */
  	pqsignal(SIGPIPE, SIG_IGN);
  	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
***************
*** 192,197 **** static int	InteractiveBackend(StringInfo inBuf);
--- 192,198 ----
  static int	interactive_getc(void);
  static int	SocketBackend(StringInfo inBuf);
  static int	ReadCommand(StringInfo inBuf);
+ static void forbidden_in_wal_sender(char firstchar);
  static List *pg_rewrite_query(Query *query);
  static bool check_log_statement(List *stmt_list);
  static int	errdetail_execute(List *raw_parsetree_list);
***************
*** 3720,3731 **** PostgresMain(int argc, char *argv[], const char *username)
  	if (IsUnderPostmaster && Log_disconnections)
  		on_proc_exit(log_disconnections, 0);
  
! 	/* If this is a WAL sender process, we're done with initialization. */
  	if (am_walsender)
! 	{
! 		WalSenderMain();		/* does not return */
! 		abort();
! 	}
  
  	/*
  	 * process any libraries that should be preloaded at backend start (this
--- 3721,3729 ----
  	if (IsUnderPostmaster && Log_disconnections)
  		on_proc_exit(log_disconnections, 0);
  
! 	/* Perform initialization specific to a WAL sender process. */
  	if (am_walsender)
! 		InitWalSender();
  
  	/*
  	 * process any libraries that should be preloaded at backend start (this
***************
*** 3835,3840 **** PostgresMain(int argc, char *argv[], const char *username)
--- 3833,3841 ----
  		 */
  		AbortCurrentTransaction();
  
+ 		if (am_walsender)
+ 			WalSndErrorCleanup();
+ 
  		/*
  		 * Now return to normal top-level context and clear ErrorContext for
  		 * next time.
***************
*** 3969,3975 **** PostgresMain(int argc, char *argv[], const char *username)
  					query_string = pq_getmsgstring(&input_message);
  					pq_getmsgend(&input_message);
  
! 					exec_simple_query(query_string);
  
  					send_ready_for_query = true;
  				}
--- 3970,3979 ----
  					query_string = pq_getmsgstring(&input_message);
  					pq_getmsgend(&input_message);
  
! 					if (am_walsender)
! 						exec_replication_command(query_string);
! 					else
! 						exec_simple_query(query_string);
  
  					send_ready_for_query = true;
  				}
***************
*** 3982,3987 **** PostgresMain(int argc, char *argv[], const char *username)
--- 3986,3993 ----
  					int			numParams;
  					Oid		   *paramTypes = NULL;
  
+ 					forbidden_in_wal_sender(firstchar);
+ 
  					/* Set statement_timestamp() */
  					SetCurrentStatementStartTimestamp();
  
***************
*** 4004,4009 **** PostgresMain(int argc, char *argv[], const char *username)
--- 4010,4017 ----
  				break;
  
  			case 'B':			/* bind */
+ 				forbidden_in_wal_sender(firstchar);
+ 
  				/* Set statement_timestamp() */
  				SetCurrentStatementStartTimestamp();
  
***************
*** 4019,4024 **** PostgresMain(int argc, char *argv[], const char *username)
--- 4027,4034 ----
  					const char *portal_name;
  					int			max_rows;
  
+ 					forbidden_in_wal_sender(firstchar);
+ 
  					/* Set statement_timestamp() */
  					SetCurrentStatementStartTimestamp();
  
***************
*** 4031,4036 **** PostgresMain(int argc, char *argv[], const char *username)
--- 4041,4048 ----
  				break;
  
  			case 'F':			/* fastpath function call */
+ 				forbidden_in_wal_sender(firstchar);
+ 
  				/* Set statement_timestamp() */
  				SetCurrentStatementStartTimestamp();
  
***************
*** 4078,4083 **** PostgresMain(int argc, char *argv[], const char *username)
--- 4090,4097 ----
  					int			close_type;
  					const char *close_target;
  
+ 					forbidden_in_wal_sender(firstchar);
+ 
  					close_type = pq_getmsgbyte(&input_message);
  					close_target = pq_getmsgstring(&input_message);
  					pq_getmsgend(&input_message);
***************
*** 4120,4125 **** PostgresMain(int argc, char *argv[], const char *username)
--- 4134,4141 ----
  					int			describe_type;
  					const char *describe_target;
  
+ 					forbidden_in_wal_sender(firstchar);
+ 
  					/* Set statement_timestamp() (needed for xact) */
  					SetCurrentStatementStartTimestamp();
  
***************
*** 4201,4206 **** PostgresMain(int argc, char *argv[], const char *username)
--- 4217,4245 ----
  	}							/* end of input-reading loop */
  }
  
+ /*
+  * Throw an error if we're a WAL sender process.
+  *
+  * This is used to forbid anything else than simple query protocol messages
+  * in a WAL sender process. 'firstchar' specifies what kind of a forbidden
+  * message was received, and is used to construct the error message.
+  */
+ static void
+ forbidden_in_wal_sender(char firstchar)
+ {
+ 	if (am_walsender)
+ 	{
+ 		if (firstchar == 'F')
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+ 					 errmsg("fastpath function calls not supported in a replication connection")));
+ 		else
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+ 					 errmsg("extended query protocol not supported in a replication connection")));
+ 	}
+ }
+ 
  
  /*
   * Obtain platform stack depth limit (in bytes)
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 19,25 ****
  /* global state */
  extern bool am_walsender;
  extern bool am_cascading_walsender;
- extern volatile sig_atomic_t walsender_shutdown_requested;
  extern volatile sig_atomic_t walsender_ready_to_stop;
  extern bool wake_wal_senders;
  
--- 19,24 ----
***************
*** 27,33 **** extern bool wake_wal_senders;
  extern int	max_wal_senders;
  extern int	replication_timeout;
  
! extern void WalSenderMain(void) __attribute__((noreturn));
  extern void WalSndSignals(void);
  extern Size WalSndShmemSize(void);
  extern void WalSndShmemInit(void);
--- 26,34 ----
  extern int	max_wal_senders;
  extern int	replication_timeout;
  
! extern void InitWalSender(void);
! extern void exec_replication_command(const char *query_string);
! extern void WalSndErrorCleanup(void);
  extern void WalSndSignals(void);
  extern Size WalSndShmemSize(void);
  extern void WalSndShmemInit(void);
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#26)
Re: Sharing more infrastructure between walsenders and regular backends (was Re: Switching timeline over streaming replication)

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

So I propose the attached patch. I made small changes to postgres.c to
make it call exec_replication_command() instead of exec_simple_query(),
and reject extend query protocol, in a WAL sender process. A lot of code
related to handling the main command loop and signals is removed from
walsender.c.

Why do we need the forbidden_in_wal_sender stuff? If we're going in
this direction, I suggest there is little reason to restrict what the
replication client can do. This seems to be both ugly and a drag on
the performance of normal backends.

regards, tom lane

#28Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#27)
Re: Sharing more infrastructure between walsenders and regular backends (was Re: Switching timeline over streaming replication)

On 04.10.2012 19:00, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

So I propose the attached patch. I made small changes to postgres.c to
make it call exec_replication_command() instead of exec_simple_query(),
and reject extend query protocol, in a WAL sender process. A lot of code
related to handling the main command loop and signals is removed from
walsender.c.

Why do we need the forbidden_in_wal_sender stuff? If we're going in
this direction, I suggest there is little reason to restrict what the
replication client can do. This seems to be both ugly and a drag on
the performance of normal backends.

Well, there's not much need for parameterized queries or cursors with
the replication command set at the moment. I don't think it's worth it
to try to support them. Fastpath function calls make no sense either, as
you can't call user-defined functions in a walsender anyway.

Perhaps we could make walsenders even more like regular backends than
what I was proposing, so that the replication commands are parsed and
executed just like regular utility commands. However, that'd require
some transaction support in walsender, for starters, which seems messy.
It might become sensible in the future if the replication command set
gets even more complicated, but it doesn't seem like a good idea at the
moment.

- Heikki

#29Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#24)
Re: Promoting a standby during base backup (was Re: Switching timeline over streaming replication)

On Thu, Oct 4, 2012 at 4:59 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 03.10.2012 18:15, Amit Kapila wrote:

On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote:

Hmm, should a base backup be aborted when the standby is promoted? Does
the promotion render the backup corrupt?

I think currently it does so. Pls refer
1.
do_pg_stop_backup(char *labelfile, bool waitforarchive)
{
..
if (strcmp(backupfrom, "standby") == 0&& !backup_started_in_recovery)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("the standby was promoted during
online backup"),
errhint("This means that the backup
being
taken is corrupt "
"and should not be used.
"
"Try taking another
online
backup.")));
..

}

Okay. I think that check in do_pg_stop_backup() actually already ensures
that you don't end up with a corrupt backup, even if the standby is promoted
while a backup is being taken. Admittedly it would be nicer to abort it
immediately rather than error out at the end.

But I wonder why promoting a standby renders the backup invalid in the first
place? Fujii, Simon, can you explain that?

Simon had the same question and I answered it before.

http://archives.postgresql.org/message-id/CAHGQGwFU04oO8YL5SUcdjVq3BRNi7WtfzTy9wA2kXtZNHicTeA@mail.gmail.com
---------------------------------------

You say
"If the standby is promoted to the master during online backup, the
backup fails."
but no explanation of why?

I could work those things out, but I don't want to have to, plus we
may disagree if I did.

If the backup succeeds in that case, when we start an archive recovery from that
backup, the recovery needs to cross between two timelines. Which means that
we need to set recovery_target_timeline before starting recovery. Whether
recovery_target_timeline needs to be set or not depends on whether the standby
was promoted during taking the backup. Leaving such a decision to a user seems
fragile.

pg_basebackup -x ensures that all required files are included in the backup and
we can start recovery without restoring any file from the archive. But
if the standby is promoted during the backup, the timeline history
file would become
an essential file for recovery, but it's not included in the backup.
---------------------------------------

The situation may change if your switching-timeline patch has been committed.
It's useful if we can continue the backup even if the standby is promoted.

Regards,

--
Fujii Masao

#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#28)
Re: Sharing more infrastructure between walsenders and regular backends (was Re: Switching timeline over streaming replication)

On 4 October 2012 17:23, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 04.10.2012 19:00, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

So I propose the attached patch. I made small changes to postgres.c to
make it call exec_replication_command() instead of exec_simple_query(),
and reject extend query protocol, in a WAL sender process. A lot of code
related to handling the main command loop and signals is removed from
walsender.c.

Why do we need the forbidden_in_wal_sender stuff? If we're going in
this direction, I suggest there is little reason to restrict what the
replication client can do. This seems to be both ugly and a drag on
the performance of normal backends.

Well, there's not much need for parameterized queries or cursors with the
replication command set at the moment. I don't think it's worth it to try to
support them. Fastpath function calls make no sense either, as you can't
call user-defined functions in a walsender anyway.

Perhaps we could make walsenders even more like regular backends than what I
was proposing, so that the replication commands are parsed and executed just
like regular utility commands. However, that'd require some transaction
support in walsender, for starters, which seems messy. It might become
sensible in the future if the replication command set gets even more
complicated, but it doesn't seem like a good idea at the moment.

It's come up a few times now that people want to run a few queries
either before or after running a base backup.

Since the pg_basebackup stuff uses walsender, this make such things impossible.

So to support that, we need to allow two kinds of connection, one to
"replication" and one to something else, and since the something else
is not guaranteed to exist that makes it even harder.

Andres suggested to me the other day we make walsender more like
regular backends. At the time I wasn't sure I agreed, but reading this
it looks like a sensible way to go.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#30)
Re: Sharing more infrastructure between walsenders and regular backends (was Re: Switching timeline over streaming replication)

Simon Riggs <simon@2ndQuadrant.com> writes:

On 4 October 2012 17:23, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Perhaps we could make walsenders even more like regular backends than what I
was proposing, so that the replication commands are parsed and executed just
like regular utility commands. However, that'd require some transaction
support in walsender, for starters, which seems messy. It might become
sensible in the future if the replication command set gets even more
complicated, but it doesn't seem like a good idea at the moment.

It's come up a few times now that people want to run a few queries
either before or after running a base backup. ...
Andres suggested to me the other day we make walsender more like
regular backends. At the time I wasn't sure I agreed, but reading this
it looks like a sensible way to go.

That was what I was thinking too, but on reflection there's at least one
huge problem: how could we run queries without being connected to a
specific database? Which walsender isn't.

regards, tom lane

#32Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#31)
Re: Sharing more infrastructure between walsenders and regular backends (was Re: Switching timeline over streaming replication)

On Thursday, October 04, 2012 10:58:53 PM Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On 4 October 2012 17:23, Heikki Linnakangas <hlinnakangas@vmware.com>

wrote:

Perhaps we could make walsenders even more like regular backends than
what I was proposing, so that the replication commands are parsed and
executed just like regular utility commands. However, that'd require
some transaction support in walsender, for starters, which seems messy.
It might become sensible in the future if the replication command set
gets even more complicated, but it doesn't seem like a good idea at the
moment.

It's come up a few times now that people want to run a few queries
either before or after running a base backup. ...
Andres suggested to me the other day we make walsender more like
regular backends. At the time I wasn't sure I agreed, but reading this
it looks like a sensible way to go.

I only went that way after youve disliked my other suggestions ;)

That was what I was thinking too, but on reflection there's at least one
huge problem: how could we run queries without being connected to a
specific database? Which walsender isn't.

I had quite some problems with that too. For now I've hacked walsender to
connect to the database specified in the connection, not sure whether thats the
way to go. Seems to work so far.

I wanted to start a thread about this anyway, but as it came up here...

The reason "we" (as in logical rep) need a database in the current approach is
that we need to access the catalog (in a timetraveling fashion) to know how to
decode the data in the wal... The patch I sent two weeks ago does the decoding
from inside a normal backend but that was just because I couldn't make
walsender work inside a database in time...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#33Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#26)
Re: Sharing more infrastructure between walsenders and regular backends (was Re: Switching timeline over streaming replication)

On Thursday, October 04, 2012 8:40 PM Heikki Linnakangas wrote:

On 03.10.2012 18:15, Amit Kapila wrote:

35.WalSenderMain(void)
{
..
+                if (walsender_shutdown_requested)
+                        ereport(FATAL,
+

(errcode(ERRCODE_ADMIN_SHUTDOWN),

+                                         errmsg("terminating
+ replication
connection due to administrator command")));
+
+                /* Tell the client that we are ready to receive
+ commands */
+                ReadyForQuery(DestRemote);
+
..
+                if (walsender_shutdown_requested)
+                        ereport(FATAL,
+

(errcode(ERRCODE_ADMIN_SHUTDOWN),

+                                         errmsg("terminating
+ replication
connection due to administrator command")));
+

is it necessary to check walsender_shutdown_requested 2 times in a
loop, if yes, then can we write comment why it is important to check
it again.

The idea was to check for shutdown request before and after the
pq_getbyte() call, because that can block for a long time.

Looking closer, we don't currently (ie. without this patch) make any
effort to react to SIGTERM in a walsender, while it's waiting for a
command from the client. After starting replication, it does check
walsender_shutdown_requested in the loop, and it's also checked during a
base backup (although only when switching to send next file, which seems
too seldom). This issue is orthogonal to handling timeline changes over
streaming replication, although that patch will make it more important
to handle SIGTERM quickly while waiting for a command, because you stay
in that mode for longer and more often.

I think walsender needs to share more infrastructure with regular
backends to handle this better. When we first implemented streaming
replication in 9.0, it made sense to implement just the bare minimum
needed to accept the handshake commands before entering the Copy state,
but now that the replication command set has grown to cover base
backups, and fetching timelines with the patch being discussed, we
should bite the bullet and make the command loop more feature-complete
and robust.

Certainly there are benefits of making walsender and backend infrastructure
similar as mentioned by Simon, Andres and Tom.
However shall we not do this as a separate feature along with some other
requirements and let switchtimeline patch get committed first
as this is altogether a different feature.

With Regards,
Amit Kapila.

#34Amit Kapila
amit.kapila@huawei.com
In reply to: Amit Kapila (#25)
Re: Switching timeline over streaming replication

On Thursday, October 04, 2012 7:22 PM Heikki Linnakangas wrote:

On Wednesday, October 03, 2012 8:45 PM Heikki Linnakangas wrote:
On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote:

Thanks for the thorough review! I committed the xlog.c refactoring

patch

now. Attached is a new version of the main patch, comments on

specific

points below. I didn't adjust the docs per your comments yet, will

do

that next.

I have some doubts regarding the comments fixed by you and some more

new

review comments.
After this I shall focus majorly towards testing of this Patch.

Testing
-----------

One more test seems to be failed. Apart from this, other tests are passed.

2. a. Master M-1
b. Standby S-1 follows M-1
c. insert 10 records on M-1. verify all records are visible on M-1,S-1
d. Stop S-1
e. insert 2 records on M-1.
f. Stop M-1
g. Start S-1
h. Promote S-1
i. Make M-1 recovery.conf such that it should connect to S-1
j. Start M-1. Below error comes on M-1 which is expected as M-1 has more
data.
LOG: database system was shut down at 2012-10-05 16:45:39 IST
LOG: entering standby mode
LOG: consistent recovery state reached at 0/176A070
LOG: record with zero length at 0/176A070
LOG: database system is ready to accept read only connections
LOG: streaming replication successfully connected to primary
LOG: fetching timeline history file for timeline 2 from primary
server
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
LOG: new timeline 2 forked off current database system timeline 1
before current recovery point 0/176A070
LOG: re-handshaking at position 0/1000000 on tli 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
LOG: new timeline 2 forked off current database system timeline 1
before current recovery point 0/176A070
k. Stop M-1. Start M-1. It is able to successfully connect to S-1 which
is a problem.
l. check in S-1. Records inserted in step-e are not present.
m. Now insert records in S-1. M-1 doesn't recieve any records. On M-1
server following log is getting printed.
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0

With Regards,
Amit Kapila.

#35Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Fujii Masao (#29)
Re: Promoting a standby during base backup (was Re: Switching timeline over streaming replication)

On 04.10.2012 20:07, Fujii Masao wrote:

On Thu, Oct 4, 2012 at 4:59 PM, Heikki Linnakangas

But I wonder why promoting a standby renders the backup invalid in the first
place? Fujii, Simon, can you explain that?

Simon had the same question and I answered it before.

http://archives.postgresql.org/message-id/CAHGQGwFU04oO8YL5SUcdjVq3BRNi7WtfzTy9wA2kXtZNHicTeA@mail.gmail.com
---------------------------------------

You say
"If the standby is promoted to the master during online backup, the
backup fails."
but no explanation of why?

I could work those things out, but I don't want to have to, plus we
may disagree if I did.

If the backup succeeds in that case, when we start an archive recovery from that
backup, the recovery needs to cross between two timelines. Which means that
we need to set recovery_target_timeline before starting recovery. Whether
recovery_target_timeline needs to be set or not depends on whether the standby
was promoted during taking the backup. Leaving such a decision to a user seems
fragile.

pg_control is backed up last, it would contain the new timeline. No need
to set recovery_target_timeline.

pg_basebackup -x ensures that all required files are included in the backup and
we can start recovery without restoring any file from the archive. But
if the standby is promoted during the backup, the timeline history
file would become
an essential file for recovery, but it's not included in the backup.

That is true. We could teach it to include the timeline history file,
though.

The situation may change if your switching-timeline patch has been committed.
It's useful if we can continue the backup even if the standby is promoted.

It wouldn't help with pg_basebackup -x, although it would allow
streaming replication to fetch the timeline history file.

I guess it's best to keep that restriction for now. But I'll add a TODO
item for this.

- Heikki

#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#29)
Re: Promoting a standby during base backup (was Re: Switching timeline over streaming replication)

On 4 October 2012 18:07, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Oct 4, 2012 at 4:59 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 03.10.2012 18:15, Amit Kapila wrote:

On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote:

Hmm, should a base backup be aborted when the standby is promoted? Does
the promotion render the backup corrupt?

I think currently it does so. Pls refer
1.
do_pg_stop_backup(char *labelfile, bool waitforarchive)
{
..
if (strcmp(backupfrom, "standby") == 0&& !backup_started_in_recovery)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("the standby was promoted during
online backup"),
errhint("This means that the backup
being
taken is corrupt "
"and should not be used.
"
"Try taking another
online
backup.")));
..

}

Okay. I think that check in do_pg_stop_backup() actually already ensures
that you don't end up with a corrupt backup, even if the standby is promoted
while a backup is being taken. Admittedly it would be nicer to abort it
immediately rather than error out at the end.

But I wonder why promoting a standby renders the backup invalid in the first
place? Fujii, Simon, can you explain that?

Simon had the same question and I answered it before.

http://archives.postgresql.org/message-id/CAHGQGwFU04oO8YL5SUcdjVq3BRNi7WtfzTy9wA2kXtZNHicTeA@mail.gmail.com
---------------------------------------

You say
"If the standby is promoted to the master during online backup, the
backup fails."
but no explanation of why?

I could work those things out, but I don't want to have to, plus we
may disagree if I did.

If the backup succeeds in that case, when we start an archive recovery from that
backup, the recovery needs to cross between two timelines. Which means that
we need to set recovery_target_timeline before starting recovery. Whether
recovery_target_timeline needs to be set or not depends on whether the standby
was promoted during taking the backup. Leaving such a decision to a user seems
fragile.

I accepted your answer before, but I think it should be challenged
now. This is definitely a time when you really want that backup, so
invalidating it for such a weak reason is not useful, even if I
understand your original thought.

Something that has concerned me is that we don't have an explicit
"timeline change record". We *say* we do that at shutdown checkpoints,
but that is recorded in the new timeline. So we have the strange
situation of changing timeline at two separate places.

When we change timeline we really should generate one last WAL on the
old timeline that marks an explicit change of timeline and a single
exact moment when the timeline change takes place. With PITR we are
unable to do that, because any timeline can fork at any point. With
smooth switchover we have a special case that is not "anything goes"
and there is a good case for not incrementing the timeline at all.

This is still a half-formed thought, but at least you should know I'm
in the debate.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#37Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#34)
1 attachment(s)
Re: Switching timeline over streaming replication

On 06.10.2012 15:58, Amit Kapila wrote:

One more test seems to be failed. Apart from this, other tests are passed.

2. a. Master M-1
b. Standby S-1 follows M-1
c. insert 10 records on M-1. verify all records are visible on M-1,S-1
d. Stop S-1
e. insert 2 records on M-1.
f. Stop M-1
g. Start S-1
h. Promote S-1
i. Make M-1 recovery.conf such that it should connect to S-1
j. Start M-1. Below error comes on M-1 which is expected as M-1 has more
data.
LOG: database system was shut down at 2012-10-05 16:45:39 IST
LOG: entering standby mode
LOG: consistent recovery state reached at 0/176A070
LOG: record with zero length at 0/176A070
LOG: database system is ready to accept read only connections
LOG: streaming replication successfully connected to primary
LOG: fetching timeline history file for timeline 2 from primary
server
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
LOG: new timeline 2 forked off current database system timeline 1
before current recovery point 0/176A070
LOG: re-handshaking at position 0/1000000 on tli 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
LOG: new timeline 2 forked off current database system timeline 1
before current recovery point 0/176A070
k. Stop M-1. Start M-1. It is able to successfully connect to S-1 which
is a problem.
l. check in S-1. Records inserted in step-e are not present.
m. Now insert records in S-1. M-1 doesn't recieve any records. On M-1
server following log is getting printed.
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0

Hmm, seems we need to keep track of which timeline we've used to recover
before. Before restart, the master correctly notices that timeline 2
forked off earlier in its history, so it cannot recover to that
timeline. But after restart the master begins recovery from the previous
checkpoint, and because timeline 2 forked off timeline 1 after the
checkpoint, it concludes that it can follow that timeline. It doesn't
realize that it had some already recovered/flushed some WAL in timeline
1 after the fork-point.

Attached is a new version of the patch. I committed the refactoring of
XLogPageRead() already, as that was a readability improvement even
without this patch. All the reported issues should be fixed now,
although I will continue testing this tomorrow. I added various checks
that that the correct timeline is followed during recovery.
minRecoveryPoint is now accompanied by a timeline ID, so that when we
restart recovery, we check that we recover back to minRecoveryPoint
along the same timeline as last time. Also, it now checks at beginning
of recovery that the checkpoint record comes from the correct timeline.
That fixes the problem that you reported above. I also adjusted the
error messages on timeline history problems to be more clear.

- Heikki

Attachments:

streaming-tli-switch-4.patch.gzapplication/x-gzip; name=streaming-tli-switch-4.patch.gzDownload
#38Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#37)
Re: Switching timeline over streaming replication

On Tuesday, October 09, 2012 10:32 PM Heikki Linnakangas wrote:

On 06.10.2012 15:58, Amit Kapila wrote:

One more test seems to be failed. Apart from this, other tests are

passed.

It seems there is one more defect, please check the same
Defect:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. Promote standby B.
5. After successful time line switch in cascade standby C, stop C.
6. Restart C, startup is failing with the following error.
FATAL: requested timeline 2 does not contain minimum recovery point
0/3000000 on timeline 1

Review:
The following statement is present in the hig-availability.sgml file, which
is also needs to be modified in the patch.

Promoting a cascading standby terminates the immediate downstream
replication connections which it serves. This is because the timeline
becomes different between standbys, and they can no longer continue
replication. The affected standby(s) may reconnect to reestablish streaming
replication.

I felt some of minor comments are still not handled:
35. +SendTimeLineHistory(TimeLineHistoryCmd *cmd) { ..
+ fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY, 0666);

error handling for fd < 0 is missing.

36.+SendTimeLineHistory(TimeLineHistoryCmd *cmd)  {  .. 
 if (nread <= 0) 
+                        ereport(ERROR, 
+                                        (errcode_for_file_access(), 
+                                         errmsg("could not read file
\"%s\": %m", 
+                                                        path)));

FileClose should be done in error case as well.

With Regards,
Amit Kapila.

#39Thom Brown
thom@linux.com
In reply to: Heikki Linnakangas (#1)
Re: Switching timeline over streaming replication

On 10 October 2012 15:26, Amit Kapila <amit.kapila@huawei.com> wrote:

On Tuesday, October 09, 2012 10:32 PM Heikki Linnakangas wrote:

On 06.10.2012 15:58, Amit Kapila wrote:

One more test seems to be failed. Apart from this, other tests are

passed.

It seems there is one more defect, please check the same
Defect:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. Promote standby B.
5. After successful time line switch in cascade standby C, stop C.
6. Restart C, startup is failing with the following error.
FATAL: requested timeline 2 does not contain minimum recovery point
0/3000000 on timeline 1

Hmm... I get something different. When I promote standby B, standby
C's log shows:

LOG: walreceiver ended streaming and awaits new instructions
LOG: re-handshaking at position 0/4000000 on tli 1
LOG: fetching timeline history file for timeline 2 from primary server
LOG: walreceiver ended streaming and awaits new instructions
LOG: new target timeline is 2

Then when I stop then start standby C I get:

FATAL: timeline history was not contiguous
LOG: startup process (PID 22986) exited with exit code 1
LOG: aborting startup due to startup process failure

--
Thom

#40Amit Kapila
amit.kapila@huawei.com
In reply to: Amit Kapila (#38)
Re: Switching timeline over streaming replication

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Amit Kapila
Sent: Wednesday, October 10, 2012 7:57 PM
To: 'Heikki Linnakangas'
Cc: 'PostgreSQL-development'
Subject: Re: [HACKERS] Switching timeline over streaming replication

On Tuesday, October 09, 2012 10:32 PM Heikki Linnakangas wrote:

On 06.10.2012 15:58, Amit Kapila wrote:

One more test seems to be failed. Apart from this, other tests are

passed.

It seems there is one more defect, please check the same
Defect:

The test is finished from myside.

one more issue:
Steps to reproduce the defect:

1. Do initdb
2. Set port=2303, wal_level=hot_standby, hot_standby=off, max_walsenders=3
in the postgresql.conf file
3. Enable the replication connection in pg_hba.conf
4. Start the server.

Executing the following commands is leading failure.

./pg_basebackup -P -D ../../data_sub -X fetch -p 2303
pg_basebackup: COPY stream ended before last file was finished

rm -fr ../../data_sub

./pg_basebackup -P -D ../../data_sub -X fetch -p 2303
pg_basebackup: COPY stream ended before last file was finished

The following logs are observed in the server console.

ERROR: requested WAL segment 000000000000000000000002 has already been
removed
ERROR: requested WAL segment 000000000000000000000003 has already been
removed

With Regards,
Amit Kapila.

#41Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#37)
Re: Switching timeline over streaming replication

Heikki Linnakangas wrote:

Attached is a new version of the patch. I committed the refactoring
of XLogPageRead() already, as that was a readability improvement
even without this patch. All the reported issues should be fixed
now, although I will continue testing this tomorrow. I added various
checks that that the correct timeline is followed during recovery.
minRecoveryPoint is now accompanied by a timeline ID, so that when
we restart recovery, we check that we recover back to
minRecoveryPoint along the same timeline as last time. Also, it now
checks at beginning of recovery that the checkpoint record comes
from the correct timeline. That fixes the problem that you reported
above. I also adjusted the error messages on timeline history
problems to be more clear.

Heikki,

I see Amit found a problem with this patch. I assume you're going to
work a bit more on it and submit/commit another version. I'm marking
this one Returned with Feedback.

Thanks.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#42Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#40)
1 attachment(s)
Re: Switching timeline over streaming replication

Here's an updated version of this patch, rebased with master, including
the recent replication timeout changes, and some other cleanup.

On 12.10.2012 09:34, Amit Kapila wrote:

The test is finished from myside.

one more issue:
...
./pg_basebackup -P -D ../../data_sub -X fetch -p 2303
pg_basebackup: COPY stream ended before last file was finished

Fixed this.

However, the test scenario you point to here:
http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e7710$@kapila@huawei.com
still seems to be broken, although I get a different error message now.
I'll dig into this..

- Heikki

Attachments:

streaming-tli-switch-5.patch.gzapplication/x-gzip; name=streaming-tli-switch-5.patch.gzDownload
#43Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#38)
Re: Switching timeline over streaming replication

On 10.10.2012 17:26, Amit Kapila wrote:

36.+SendTimeLineHistory(TimeLineHistoryCmd *cmd)  {  ..
if (nread<= 0)
+                        ereport(ERROR,
+                                        (errcode_for_file_access(),
+                                         errmsg("could not read file
\"%s\": %m",
+                                                        path)));

FileClose should be done in error case as well.

Hmm, I think you're right. The straightforward fix to just call
FileClose() before the ereport()s in that function would not be enough,
though. You might run out of memory in pq_sendbytes(), for example,
which would throw an error. We could use PG_TRY/CATCH for this, but
seems like overkill. Perhaps the simplest fix is to use a global
(static) variable for the fd, and clean it up in WalSndErrorCleanup().

This is a fairly general issue, actually. Looking around, I can see at
least two similar cases in existing code, with BasicOpenFile, where we
will leak file descriptors on error:

copy_file: there are several error cases, including out-of-disk space,
with no attempt to close the fds.

XLogFileInit: again, no attempt to close the file descriptor on failure.
This is called at checkpoint from the checkpointer process, to
preallocate new xlog files.

Given that we haven't heard any complaints of anyone running into these,
these are not a big deal in practice, but in theory at least the
XLogFileInit leak could lead to serious problems, as it could cause the
checkpointer to run out of file descriptors.

- Heikki

#44Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#42)
1 attachment(s)
Re: Switching timeline over streaming replication

On 15.11.2012 12:44, Heikki Linnakangas wrote:

Here's an updated version of this patch, rebased with master, including
the recent replication timeout changes, and some other cleanup.

On 12.10.2012 09:34, Amit Kapila wrote:

The test is finished from myside.

one more issue:
...
./pg_basebackup -P -D ../../data_sub -X fetch -p 2303
pg_basebackup: COPY stream ended before last file was finished

Fixed this.

However, the test scenario you point to here:
http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e7710$@kapila@huawei.com
still seems to be broken, although I get a different error message now.
I'll dig into this..

Ok, here's an updated patch again, with that bug fixed.

- Heikki

Attachments:

streaming-tli-switch-6.patch.gzapplication/x-gzip; name=streaming-tli-switch-6.patch.gzDownload
#45Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#44)
Re: Switching timeline over streaming replication

On Thursday, November 15, 2012 6:05 PM Heikki Linnakangas wrote:

On 15.11.2012 12:44, Heikki Linnakangas wrote:

Here's an updated version of this patch, rebased with master,
including the recent replication timeout changes, and some other

cleanup.

On 12.10.2012 09:34, Amit Kapila wrote:

The test is finished from myside.

one more issue:
...
./pg_basebackup -P -D ../../data_sub -X fetch -p 2303
pg_basebackup: COPY stream ended before last file was finished

Fixed this.

However, the test scenario you point to here:
http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e77
10$@kapila@huawei.com still seems to be broken, although I get a
different error message now.
I'll dig into this..

Ok, here's an updated patch again, with that bug fixed.

I shall review and test the updated Patch in Commit Fest.

With Regards,
Amit Kapila.

#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#43)
Re: Switching timeline over streaming replication

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

This is a fairly general issue, actually. Looking around, I can see at
least two similar cases in existing code, with BasicOpenFile, where we
will leak file descriptors on error:

Um, don't we automatically clean those up during transaction abort?
If we don't, we ought to think about that, not about cluttering calling
code with certain-to-be-inadequate cleanup in error cases.

regards, tom lane

#47Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#46)
Re: Switching timeline over streaming replication

On 15.11.2012 16:55, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

This is a fairly general issue, actually. Looking around, I can see at
least two similar cases in existing code, with BasicOpenFile, where we
will leak file descriptors on error:

Um, don't we automatically clean those up during transaction abort?

Not the ones allocated with PathNameOpenFile or BasicOpenFile. Files
allocated with AllocateFile() and OpenTemporaryFile() are cleaned up at
abort.

If we don't, we ought to think about that, not about cluttering calling
code with certain-to-be-inadequate cleanup in error cases.

Agreed. Cleaning up at end-of-xact won't help walsender or other
non-backend processes, though, because they don't do transactions. But a
top-level ResourceOwner that's reset in the sigsetjmp() cleanup routine
would work.

- Heikki

#48Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#44)
Re: Switching timeline over streaming replication

On Thursday, November 15, 2012 6:05 PM Heikki Linnakangas wrote:

On 15.11.2012 12:44, Heikki Linnakangas wrote:

Here's an updated version of this patch, rebased with master,
including the recent replication timeout changes, and some other

cleanup.

On 12.10.2012 09:34, Amit Kapila wrote:

The test is finished from myside.

one more issue:
...
./pg_basebackup -P -D ../../data_sub -X fetch -p 2303
pg_basebackup: COPY stream ended before last file was finished

Fixed this.

However, the test scenario you point to here:
http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e77
10$@kapila@huawei.com still seems to be broken, although I get a
different error message now.
I'll dig into this..

Ok, here's an updated patch again, with that bug fixed.

First, I started with test of this Patch.

Basic stuff:
------------
- Patch applies OK
- Compiles cleanly with no warnings
- Regression tests pass except the "standbycheck".

From a glance view of the "standbycheck" regression failures are because of
sql scripts and expected outputs are little old.

The following problems are observed while testing of the patch.
Defect-1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. Promote standby B.
5. After successful time line switch in cascade standby C, stop C.
6. Restart C, startup is failing with the following error.

LOG: database system was shut down in recovery at 2012-11-16
16:26:29 IST
FATAL: requested timeline 2 does not contain minimum recovery point
0/30143A0 on timeline 1
LOG: startup process (PID 415) exited with exit code 1
LOG: aborting startup due to startup process failure

The above defect is already discussed in the following link.
http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e7710$@ka
pila@huawei.com

Defect-2:

1. start primary A
2. start standby B following A
3. start cascade standby C following B with 'recovery_target_timeline'
option in
recovery.conf is disabled.
4. Promote standby B.
5. Cascade Standby C is not able to follow the new master B because of
timeline difference.
6. Try to stop the cascade standby C (which is failing and the
server is not stopping,
observations are as WAL Receiver process is still running and
clients are not allowing to connect).

The defect-2 is happened only once in my test environment, I will try to
reproduce it.

With Regards,
Amit Kapila.

#49Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#48)
1 attachment(s)
Re: Switching timeline over streaming replication

On 10.10.2012 17:54, Thom Brown wrote:

Hmm... I get something different. When I promote standby B, standby
C's log shows:

LOG: walreceiver ended streaming and awaits new instructions
LOG: re-handshaking at position 0/4000000 on tli 1
LOG: fetching timeline history file for timeline 2 from primary server
LOG: walreceiver ended streaming and awaits new instructions
LOG: new target timeline is 2

Then when I stop then start standby C I get:

FATAL: timeline history was not contiguous
LOG: startup process (PID 22986) exited with exit code 1
LOG: aborting startup due to startup process failure

Found & fixed this one. A paren was misplaced in tliOfPointInHistory()
function..

On 16.11.2012 16:01, Amit Kapila wrote:

The following problems are observed while testing of the patch.
Defect-1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. Promote standby B.
5. After successful time line switch in cascade standby C, stop C.
6. Restart C, startup is failing with the following error.

LOG: database system was shut down in recovery at 2012-11-16
16:26:29 IST
FATAL: requested timeline 2 does not contain minimum recovery point
0/30143A0 on timeline 1
LOG: startup process (PID 415) exited with exit code 1
LOG: aborting startup due to startup process failure

The above defect is already discussed in the following link.
http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e7710$@ka
pila@huawei.com

Fixed now, sorry for neglecting this earlier. The problem was that if
the primary switched to a new timeline at position X, and the standby
followed that switch, on restart it would set minRecoveryPoint to X, and
the new

Defect-2:

1. start primary A
2. start standby B following A
3. start cascade standby C following B with 'recovery_target_timeline'
option in
recovery.conf is disabled.
4. Promote standby B.
5. Cascade Standby C is not able to follow the new master B because of
timeline difference.
6. Try to stop the cascade standby C (which is failing and the
server is not stopping,
observations are as WAL Receiver process is still running and
clients are not allowing to connect).

The defect-2 is happened only once in my test environment, I will try to
reproduce it.

Found it. When restarting the streaming, I reused the WALRCV_STARTING
state. But if you then exited recovery, WalRcvRunning() would think that
the walreceiver is stuck starting up, because it's been longer than 10
seconds since it was launched and it's still in WALRCV_STARTING state,
so it put it into WALRCV_STOPPED state. And walreceiver didn't expect to
be put into STOPPED state after having started up successfully already.

I added a new explicit WALRCV_RESTARTING state to handle that.

In addition to the above bug fixes, there's some small changes since
last patch version:

* I changed the LOG messages printed in various stages a bit, hopefully
making it easier to follow what's happening. Feedback is welcome on when
and how we should log, and whether some error messages need clarification.

* 'ps' display is updated when the walreceiver enters and exits idle mode

* Updated pg_controldata and pg_resetxlog to handle the new
minRecoveryPointTLI field I added to the control file.

* startup process wakes up walsenders at the end of recovery, so that
cascading standbys are notified immediately when the timeline changes.
That removes some of the delay in the process.

- Heikki

Attachments:

streaming-tli-switch-7.patch.gzapplication/x-gzip; name=streaming-tli-switch-7.patch.gzDownload
#50Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#49)
Re: Switching timeline over streaming replication

On Monday, November 19, 2012 10:54 PM Heikki Linnakangas wrote:

On 10.10.2012 17:54, Thom Brown wrote:

Hmm... I get something different. When I promote standby B, standby
C's log shows:

The following problems are observed while testing of the patch.
Defect-1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. Promote standby B.
5. After successful time line switch in cascade standby C, stop

C.

6. Restart C, startup is failing with the following error.

LOG: database system was shut down in recovery at 2012-11-16
16:26:29 IST
FATAL: requested timeline 2 does not contain minimum
recovery point 0/30143A0 on timeline 1
LOG: startup process (PID 415) exited with exit code 1
LOG: aborting startup due to startup process failure

The above defect is already discussed in the following link.
http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e77
10$@ka
pila@huawei.com

Fixed now, sorry for neglecting this earlier. The problem was that if
the primary switched to a new timeline at position X, and the standby
followed that switch, on restart it would set minRecoveryPoint to X, and
the new

Not sure, if above is fixed as I don't see any code change for this and in
test also it again fails.

Below is result of further testing:

Some strange logs are observed during testing.

Note: Stop the node means doing a smart shutdown.

Scenario-1:
1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. Execute the following commands in the primary A.
create table tbl(f int);
insert into tbl values(generate_series(1,1000));
5. Promote standby B.
6. Execute the following commands in the primary B.
insert into tbl values(generate_series(1001,2000));
insert into tbl values(generate_series(2001,3000));

The following logs are presents on the following standby C.
please check these are proper or not?

LOG: restarted WAL streaming at position 0/B000000 on tli 2
LOG: record with zero length at 0/B024C68
LOG: record with zero length at 0/B035528
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
00000002000000000000000B, offset 0

Following two defects are found while testing the new patch.

Defect - 1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. start another standby D following C.
5. Promote standby B.
6. After successful time line switch in cascade standby C & D, stop D.
7. Restart D, Startup is successful and connecting to standby C.
8. Stop C.
9. Restart C, startup is failing.

Defect-2:
1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. Start another standby D following C.
5. Execute the following commands in the primary A.
create table tbl(f int);
insert into tbl values(generate_series(1,1000));
6. Promote standby B.
7. Execute the following commands in the primary B.
insert into tbl values(generate_series(1001,2000));
insert into tbl values(generate_series(2001,3000));

The following logs are observed on standby C:

LOG: restarted WAL streaming at position 0/7000000 on tli 2
ERROR: requested WAL segment 000000020000000000000007 has already been
removed
LOG: record with zero length at 0/7028190
LOG: record with zero length at 0/7048540
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000007, offset 0

The following logs are observed on standby D:

LOG: restarted WAL streaming at position 0/7000000 on tli 2
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 2
FATAL: error reading result of streaming command: ERROR: requested WAL
segment 000000020000000000000007 has already been removed

LOG: streaming replication successfully connected to primary

8. Stop standby D normally and restart D. Restart is failing.

Code Review
------------------
1.

Agreed. Cleaning up at end-of-xact won't help walsender or other

non-backend processes, though, because they don't do

transactions. But a top-level ResourceOwner that's reset in the

sigsetjmp() cleanup routine would work.

Do you think cleanup of files be done as part of this patch or should it be
handled separately,
as it already exists in other paths of code. In that case may be one ToDo
item can be added.

2. Also for forbidden_in_wal_sender(firstchar);, instead of handling it as
part of each message,
isn't it better if we call only once, something like
is_command_allowed(firstchar);

switch (firstchar)

3. For function
WalRcvStreaming(void)
{
..
if (state == WALRCV_STREAMING || state == WALRCV_STARTING)
}
I think in above check, it should check the new state as well
WALRCV_RESTARTING

4. In function WalReceiverMain(),
can we have Log message similar to below for even "started WAL streaming
at position", means when it is even first time.
if (!first_stream)
ereport(LOG,
(errmsg("restarted WAL
streaming at position %X/%X on tli %u",
(uint32)
(startpoint >> 32), (uint32) startpoint,

startpointTLI)));

5. In function WalReceiverMain(),
if (walrcv_startstreaming(startpointTLI, startpoint))
{
..
}
else
ereport(LOG,
(errmsg("primary server contains no
more WAL on requested timeline %u",
startpointTLI)));
I think walrcv_startstreaming() can return false even if it cannot start
streaming due to non-availablity of WAL
at requested timeline. So "no more" in Log message may be misleading in some
cases. How about something similar to
"primary server doesn't have WAL for requested timeline"

6. *** a/src/bin/pg_controldata/pg_controldata.c 
--- b/src/bin/pg_controldata/pg_controldata.c 
*************** 
*** 237,242 **** main(int argc, char *argv[]) 
--- 237,244 ---- 
          printf(_("Minimum recovery ending location:     %X/%X\n"), 
                     (uint32) (ControlFile.minRecoveryPoint >> 32), 
                     (uint32) ControlFile.minRecoveryPoint); 
+         printf(_("Recovery ending timeline:             %u\n"), 
+                    ControlFile.minRecoveryPointTLI); 

shouldn't Message "Recovery ending timeline" be "Minimum Recovery ending
timeline"

One Doubt
---------------
when pg_receivexlog is receiving xlog file from the standby server, if the
standby server got promoted
as master the connection between pg_receivexlog and standby server is
broken, because of this reason
the current xlog file is not renamed as actual file.

Whether such a scenario needs any handling?

With Regards,
Amit Kapila.

#51Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#50)
1 attachment(s)
Re: Switching timeline over streaming replication

On 20.11.2012 15:33, Amit Kapila wrote:

Defect-2:
1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. Start another standby D following C.
5. Execute the following commands in the primary A.
create table tbl(f int);
insert into tbl values(generate_series(1,1000));
6. Promote standby B.
7. Execute the following commands in the primary B.
insert into tbl values(generate_series(1001,2000));
insert into tbl values(generate_series(2001,3000));

The following logs are observed on standby C:

LOG: restarted WAL streaming at position 0/7000000 on tli 2
ERROR: requested WAL segment 000000020000000000000007 has already been
removed
LOG: record with zero length at 0/7028190
LOG: record with zero length at 0/7048540
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000007, offset 0

Hmm, this one is actually a pre-existing bug. There's a sanity check
that the sequence of timeline IDs that are seen in the XLOG page headers
doesn't go backwards. In other words, if the last XLOG page that was
read had timeline id X, the next page must have a tli >= X. The startup
process keeps track of the last seen timeline id in lastPageTLI. In
standby_mode, when the startup process is reading from a pre-existing
file in pg_xlog (typically put there by streaming replication) and it
reaches the end of valid WAL (marked by an error in decoding it, ie.
"record with zero length" in your case), it sleeps for five seconds and
retries. At retry, the WAL file is re-opened, and as part of sanity
checking it, the first page header in the file is validated.

Now, if there was a timeline change in the current WAL segment, and
we've already replayed past that point, lastPageTLI will already be set
to the new TLI, but the first page on the file contains the old TLI.
When the file is re-opened, and the first page is validated, you get the
error.

The fix is quite straightforward: we should refrain from checking the
TLI when we re-open a WAL file. Or better yet, compare it against the
TLI we saw at the beginning of the last WAL segment, not the last WAL page.

I propose the attached patch (against 9.2) to fix that. This should be
backpatched to 9.0, where standby_mode was introduced. The code was the
same in 8.4, too, but AFAICS there was no problem there because 8.4
never tried to re-open the same WAL segment after replaying some of it.

- Heikki

Attachments:

fix-segment-reread-after-tli-switch-1.patchtext/x-diff; name=fix-segment-reread-after-tli-switch-1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8614907..045d21d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -572,6 +572,7 @@ static uint32 readRecordBufSize = 0;
 static XLogRecPtr ReadRecPtr;	/* start of last record read */
 static XLogRecPtr EndRecPtr;	/* end+1 of last record read */
 static TimeLineID lastPageTLI = 0;
+static TimeLineID lastSegmentTLI = 0;
 
 static XLogRecPtr minRecoveryPoint;		/* local copy of
 										 * ControlFile->minRecoveryPoint */
@@ -655,7 +656,7 @@ static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool ValidXLOGHeader(XLogPageHeader hdr, int emode);
+static bool ValidXLOGHeader(XLogPageHeader hdr, int emode, bool segmentonly);
 static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
 static List *readTimeLineHistory(TimeLineID targetTLI);
 static bool existsTimeLineHistory(TimeLineID probeTLI);
@@ -3927,7 +3928,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 		 * to go backwards (but we can't reset that variable right here, since
 		 * we might not change files at all).
 		 */
-		lastPageTLI = 0;		/* see comment in ValidXLOGHeader */
+		lastPageTLI = lastSegmentTLI = 0;	/* see comment in ValidXLOGHeader */
 		randAccess = true;		/* allow curFileTLI to go backwards too */
 	}
 
@@ -4190,7 +4191,7 @@ next_record_is_invalid:
  * ReadRecord.	It's not intended for use from anywhere else.
  */
 static bool
-ValidXLOGHeader(XLogPageHeader hdr, int emode)
+ValidXLOGHeader(XLogPageHeader hdr, int emode, bool segmentonly)
 {
 	XLogRecPtr	recaddr;
 
@@ -4285,18 +4286,31 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 	 * successive pages of a consistent WAL sequence.
 	 *
 	 * Of course this check should only be applied when advancing sequentially
-	 * across pages; therefore ReadRecord resets lastPageTLI to zero when
-	 * going to a random page.
+	 * across pages; therefore ReadRecord resets lastPageTLI and
+	 * lastSegmentTLI to zero when going to a random page.
+	 *
+	 * Sometimes we re-open a segment that's already been partially replayed.
+	 * In that case we cannot perform the normal TLI check: if there is a
+	 * timeline switch within the segment, the first page has a smaller TLI
+	 * than later pages following the timeline switch, and we might've read
+	 * them already. As a weaker test, we still check that it's not smaller
+	 * than the TLI we last saw at the beginning of a segment. Pass
+	 * segmentonly = true when re-validating the first page like that, and the
+	 * page you're actually interested in comes later.
 	 */
-	if (hdr->xlp_tli < lastPageTLI)
+	if (hdr->xlp_tli < (segmentonly ? lastSegmentTLI : lastPageTLI))
 	{
 		ereport(emode_for_corrupt_record(emode, recaddr),
 				(errmsg("out-of-sequence timeline ID %u (after %u) in log file %u, segment %u, offset %u",
-						hdr->xlp_tli, lastPageTLI,
+						hdr->xlp_tli,
+						segmentonly ? lastSegmentTLI : lastPageTLI,
 						readId, readSeg, readOff)));
 		return false;
 	}
 	lastPageTLI = hdr->xlp_tli;
+	if (readOff == 0)
+		lastSegmentTLI = hdr->xlp_tli;
+
 	return true;
 }
 
@@ -10440,7 +10454,7 @@ retry:
 							readId, readSeg, readOff)));
 			goto next_record_is_invalid;
 		}
-		if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
+		if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, true))
 			goto next_record_is_invalid;
 	}
 
@@ -10462,7 +10476,7 @@ retry:
 				readId, readSeg, readOff)));
 		goto next_record_is_invalid;
 	}
-	if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
+	if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
 		goto next_record_is_invalid;
 
 	Assert(targetId == readId);
#52Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#51)
Re: Switching timeline over streaming replication

On Wednesday, November 21, 2012 11:36 PM Heikki Linnakangas wrote:

On 20.11.2012 15:33, Amit Kapila wrote:

Defect-2:
1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. Start another standby D following C.
5. Execute the following commands in the primary A.
create table tbl(f int);
insert into tbl values(generate_series(1,1000));
6. Promote standby B.
7. Execute the following commands in the primary B.
insert into tbl values(generate_series(1001,2000));
insert into tbl values(generate_series(2001,3000));

The following logs are observed on standby C:

LOG: restarted WAL streaming at position 0/7000000 on tli 2
ERROR: requested WAL segment 000000020000000000000007 has
already been removed
LOG: record with zero length at 0/7028190
LOG: record with zero length at 0/7048540
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000007, offset 0

I propose the attached patch (against 9.2) to fix that. This should be
backpatched to 9.0, where standby_mode was introduced. The code was the
same in 8.4, too, but AFAICS there was no problem there because 8.4
never tried to re-open the same WAL segment after replaying some of it.

Fixed.

With Regards,
Amit Kapila.

#53Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#47)
1 attachment(s)
Plugging fd leaks (was Re: Switching timeline over streaming replication)

On 15.11.2012 17:16, Heikki Linnakangas wrote:

On 15.11.2012 16:55, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

This is a fairly general issue, actually. Looking around, I can see at
least two similar cases in existing code, with BasicOpenFile, where we
will leak file descriptors on error:

Um, don't we automatically clean those up during transaction abort?

Not the ones allocated with PathNameOpenFile or BasicOpenFile. Files
allocated with AllocateFile() and OpenTemporaryFile() are cleaned up at
abort.

If we don't, we ought to think about that, not about cluttering calling
code with certain-to-be-inadequate cleanup in error cases.

Agreed. Cleaning up at end-of-xact won't help walsender or other
non-backend processes, though, because they don't do transactions. But a
top-level ResourceOwner that's reset in the sigsetjmp() cleanup routine
would work.

This is what I came up with. It adds a new function, OpenFile, that
returns a raw file descriptor like BasicOpenFile, but the file
descriptor is associated with the current subtransaction and
automatically closed at end-of-xact, in the same way that AllocateFile
and AllocateDir work. In other words, OpenFile is to open() what
AllocateFile is to fopen(). BasicOpenFile is unchanged, it returns a raw
fd and it's solely the caller's responsibility to close it, but many of
the places that used to call BasicOpenFile now use the safer OpenFile
function instead.

This patch plugs three existing fd (or virtual fd) leaks:

1. copy_file() - fixed by by using OpenFile instead of BasicOpenFile
2. XLogFileLinit() - fixed by adding close() calls to the error cases.
Can't use OpenFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenFile instead of
PathNameOpenFile.

In addition, this replaces many BasicOpenFile() calls with OpenFile()
that were not leaking, because the code meticulously closed the file on
error. That wasn't strictly necessary, but IMHO it's good for robustness.

One thing I'm not too fond of is the naming. I'm calling the new
functions OpenFile and CloseFile. There's some danger of confusion
there, as the function to close a virtual file opened with
PathNameOpenFile is called FileClose. OpenFile is really the same kind
of operation as AllocateFile and AllocateDir, but returns an unbuffered
fd. So it would be nice if it was called AllocateSomething, too. But
AllocateFile is already taken. And I don't much like the Allocate*
naming for these anyway, you really would expect the name to contain "open".

Do we want to backpatch this? We've had zero complaints, but this seems
fairly safe to backpatch, and at least the leak in copy_file() can be
quite annoying. If you run out of disk space in CREATE DATABASE, the
target file is kept open even though it's deleted, so the space isn't
reclaimed until you disconnect.

- Heikki

Attachments:

fd-automatic-cleanup-1.patchtext/x-diff; name=fd-automatic-cleanup-1.patchDownload
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index dd69c23..cd60dd8 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -531,7 +531,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
 		int			i;
 
 		for (i = 0; i < fdata->num_files; i++)
-			close(fdata->fd[i]);
+			CloseFile(fdata->fd[i]);
 	}
 
 	/* Re-acquire control lock and update page state */
@@ -593,7 +593,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 	 * SlruPhysicalWritePage).	Hence, if we are InRecovery, allow the case
 	 * where the file doesn't exist, and return zeroes instead.
 	 */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
+	fd = OpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
 	if (fd < 0)
 	{
 		if (errno != ENOENT || !InRecovery)
@@ -614,7 +614,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 	{
 		slru_errcause = SLRU_SEEK_FAILED;
 		slru_errno = errno;
-		close(fd);
+		CloseFile(fd);
 		return false;
 	}
 
@@ -623,11 +623,11 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 	{
 		slru_errcause = SLRU_READ_FAILED;
 		slru_errno = errno;
-		close(fd);
+		CloseFile(fd);
 		return false;
 	}
 
-	if (close(fd))
+	if (CloseFile(fd))
 	{
 		slru_errcause = SLRU_CLOSE_FAILED;
 		slru_errno = errno;
@@ -740,7 +740,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 		 * don't use O_EXCL or O_TRUNC or anything like that.
 		 */
 		SlruFileName(ctl, path, segno);
-		fd = BasicOpenFile(path, O_RDWR | O_CREAT | PG_BINARY,
+		fd = OpenFile(path, O_RDWR | O_CREAT | PG_BINARY,
 						   S_IRUSR | S_IWUSR);
 		if (fd < 0)
 		{
@@ -773,7 +773,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 		slru_errcause = SLRU_SEEK_FAILED;
 		slru_errno = errno;
 		if (!fdata)
-			close(fd);
+			CloseFile(fd);
 		return false;
 	}
 
@@ -786,7 +786,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 		slru_errcause = SLRU_WRITE_FAILED;
 		slru_errno = errno;
 		if (!fdata)
-			close(fd);
+			CloseFile(fd);
 		return false;
 	}
 
@@ -800,11 +800,11 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 		{
 			slru_errcause = SLRU_FSYNC_FAILED;
 			slru_errno = errno;
-			close(fd);
+			CloseFile(fd);
 			return false;
 		}
 
-		if (close(fd))
+		if (CloseFile(fd))
 		{
 			slru_errcause = SLRU_CLOSE_FAILED;
 			slru_errno = errno;
@@ -1078,7 +1078,7 @@ SimpleLruFlush(SlruCtl ctl, bool checkpoint)
 			ok = false;
 		}
 
-		if (close(fdata.fd[i]))
+		if (CloseFile(fdata.fd[i]))
 		{
 			slru_errcause = SLRU_CLOSE_FAILED;
 			slru_errno = errno;
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 6006d3d..a62f7ba 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -244,7 +244,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL,
+	fd = OpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL,
 					   S_IRUSR | S_IWUSR);
 	if (fd < 0)
 		ereport(ERROR,
@@ -262,7 +262,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	else
 		TLHistoryFilePath(path, parentTLI);
 
-	srcfd = BasicOpenFile(path, O_RDONLY, 0);
+	srcfd = OpenFile(path, O_RDONLY, 0);
 	if (srcfd < 0)
 	{
 		if (errno != ENOENT)
@@ -304,7 +304,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 					 errmsg("could not write to file \"%s\": %m", tmppath)));
 			}
 		}
-		close(srcfd);
+		CloseFile(srcfd);
 	}
 
 	/*
@@ -345,7 +345,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 
-	if (close(fd))
+	if (CloseFile(fd))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 29a2ee6..ba8b55f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -970,15 +970,10 @@ EndPrepare(GlobalTransaction gxact)
 
 	/*
 	 * Create the 2PC state file.
-	 *
-	 * Note: because we use BasicOpenFile(), we are responsible for ensuring
-	 * the FD gets closed in any error exit path.  Once we get into the
-	 * critical section, though, it doesn't matter since any failure causes
-	 * PANIC anyway.
 	 */
 	TwoPhaseFilePath(path, xid);
 
-	fd = BasicOpenFile(path,
+	fd = OpenFile(path,
 					   O_CREAT | O_EXCL | O_WRONLY | PG_BINARY,
 					   S_IRUSR | S_IWUSR);
 	if (fd < 0)
@@ -995,7 +990,7 @@ EndPrepare(GlobalTransaction gxact)
 		COMP_CRC32(statefile_crc, record->data, record->len);
 		if ((write(fd, record->data, record->len)) != record->len)
 		{
-			close(fd);
+			CloseFile(fd);
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not write two-phase state file: %m")));
@@ -1012,7 +1007,7 @@ EndPrepare(GlobalTransaction gxact)
 
 	if ((write(fd, &bogus_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
 	{
-		close(fd);
+		CloseFile(fd);
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not write two-phase state file: %m")));
@@ -1021,7 +1016,7 @@ EndPrepare(GlobalTransaction gxact)
 	/* Back up to prepare for rewriting the CRC */
 	if (lseek(fd, -((off_t) sizeof(pg_crc32)), SEEK_CUR) < 0)
 	{
-		close(fd);
+		CloseFile(fd);
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not seek in two-phase state file: %m")));
@@ -1061,13 +1056,13 @@ EndPrepare(GlobalTransaction gxact)
 	/* write correct CRC and close file */
 	if ((write(fd, &statefile_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
 	{
-		close(fd);
+		CloseFile(fd);
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not write two-phase state file: %m")));
 	}
 
-	if (close(fd) != 0)
+	if (CloseFile(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close two-phase state file: %m")));
@@ -1144,7 +1139,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 
 	TwoPhaseFilePath(path, xid);
 
-	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
+	fd = OpenFile(path, O_RDONLY | PG_BINARY, 0);
 	if (fd < 0)
 	{
 		if (give_warnings)
@@ -1163,7 +1158,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	 */
 	if (fstat(fd, &stat))
 	{
-		close(fd);
+		CloseFile(fd);
 		if (give_warnings)
 			ereport(WARNING,
 					(errcode_for_file_access(),
@@ -1177,14 +1172,14 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 						sizeof(pg_crc32)) ||
 		stat.st_size > MaxAllocSize)
 	{
-		close(fd);
+		CloseFile(fd);
 		return NULL;
 	}
 
 	crc_offset = stat.st_size - sizeof(pg_crc32);
 	if (crc_offset != MAXALIGN(crc_offset))
 	{
-		close(fd);
+		CloseFile(fd);
 		return NULL;
 	}
 
@@ -1195,7 +1190,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 
 	if (read(fd, buf, stat.st_size) != stat.st_size)
 	{
-		close(fd);
+		CloseFile(fd);
 		if (give_warnings)
 			ereport(WARNING,
 					(errcode_for_file_access(),
@@ -1205,7 +1200,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 		return NULL;
 	}
 
-	close(fd);
+	CloseFile(fd);
 
 	hdr = (TwoPhaseFileHeader *) buf;
 	if (hdr->magic != TWOPHASE_MAGIC || hdr->total_len != stat.st_size)
@@ -1469,7 +1464,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 
 	TwoPhaseFilePath(path, xid);
 
-	fd = BasicOpenFile(path,
+	fd = OpenFile(path,
 					   O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY,
 					   S_IRUSR | S_IWUSR);
 	if (fd < 0)
@@ -1481,14 +1476,14 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	/* Write content and CRC */
 	if (write(fd, content, len) != len)
 	{
-		close(fd);
+		CloseFile(fd);
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not write two-phase state file: %m")));
 	}
 	if (write(fd, &statefile_crc, sizeof(pg_crc32)) != sizeof(pg_crc32))
 	{
-		close(fd);
+		CloseFile(fd);
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not write two-phase state file: %m")));
@@ -1500,13 +1495,13 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	 */
 	if (pg_fsync(fd) != 0)
 	{
-		close(fd);
+		CloseFile(fd);
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync two-phase state file: %m")));
 	}
 
-	if (close(fd) != 0)
+	if (CloseFile(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close two-phase state file: %m")));
@@ -1577,7 +1572,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 
 		TwoPhaseFilePath(path, xid);
 
-		fd = BasicOpenFile(path, O_RDWR | PG_BINARY, 0);
+		fd = OpenFile(path, O_RDWR | PG_BINARY, 0);
 		if (fd < 0)
 		{
 			if (errno == ENOENT)
@@ -1596,14 +1591,14 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 
 		if (pg_fsync(fd) != 0)
 		{
-			close(fd);
+			CloseFile(fd);
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not fsync two-phase state file \"%s\": %m",
 							path)));
 		}
 
-		if (close(fd) != 0)
+		if (CloseFile(fd) != 0)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not close two-phase state file \"%s\": %m",
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2540c..5fc000e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2246,6 +2246,16 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 
 	unlink(tmppath);
 
+	/*
+	 * Allocate a buffer full of zeros. This is done before opening the file
+	 * so that we don't leak the file descriptor if palloc fails.
+	 *
+	 * Note: palloc zbuffer, instead of just using a local char array, to
+	 * ensure it is reasonably well-aligned; this may save a few cycles
+	 * transferring data to the kernel.
+	 */
+	zbuffer = (char *) palloc0(XLOG_BLCKSZ);
+
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
 	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
 					   S_IRUSR | S_IWUSR);
@@ -2262,12 +2272,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 * fsync below) that all the indirect blocks are down on disk.	Therefore,
 	 * fdatasync(2) or O_DSYNC will be sufficient to sync future writes to the
 	 * log file.
-	 *
-	 * Note: palloc zbuffer, instead of just using a local char array, to
-	 * ensure it is reasonably well-aligned; this may save a few cycles
-	 * transferring data to the kernel.
 	 */
-	zbuffer = (char *) palloc0(XLOG_BLCKSZ);
 	for (nbytes = 0; nbytes < XLogSegSize; nbytes += XLOG_BLCKSZ)
 	{
 		errno = 0;
@@ -2279,6 +2284,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 			 * If we fail to make the file, delete it to release disk space
 			 */
 			unlink(tmppath);
+
+			close(fd);
+
 			/* if write didn't set errno, assume problem is no disk space */
 			errno = save_errno ? save_errno : ENOSPC;
 
@@ -2290,9 +2298,12 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	pfree(zbuffer);
 
 	if (pg_fsync(fd) != 0)
+	{
+		close(fd);
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
+	}
 
 	if (close(fd))
 		ereport(ERROR,
@@ -2363,7 +2374,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno)
 	 * Open the source file
 	 */
 	XLogFilePath(path, srcTLI, srcsegno);
-	srcfd = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
+	srcfd = OpenFile(path, O_RDONLY | PG_BINARY, 0);
 	if (srcfd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -2377,7 +2388,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno)
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+	fd = OpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
 					   S_IRUSR | S_IWUSR);
 	if (fd < 0)
 		ereport(ERROR,
@@ -2423,12 +2434,12 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno)
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 
-	if (close(fd))
+	if (CloseFile(fd))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
 
-	close(srcfd);
+	CloseFile(srcfd);
 
 	/*
 	 * Now move the segment into place with its final name.
diff --git a/src/backend/libpq/be-fsstubs.c b/src/backend/libpq/be-fsstubs.c
index dbc00b4..6be0a07 100644
--- a/src/backend/libpq/be-fsstubs.c
+++ b/src/backend/libpq/be-fsstubs.c
@@ -442,7 +442,7 @@ lo_import_with_oid(PG_FUNCTION_ARGS)
 static Oid
 lo_import_internal(text *filename, Oid lobjOid)
 {
-	File		fd;
+	int			fd;
 	int			nbytes,
 				tmp PG_USED_FOR_ASSERTS_ONLY;
 	char		buf[BUFSIZE];
@@ -464,7 +464,7 @@ lo_import_internal(text *filename, Oid lobjOid)
 	 * open the file to be read in
 	 */
 	text_to_cstring_buffer(filename, fnamebuf, sizeof(fnamebuf));
-	fd = PathNameOpenFile(fnamebuf, O_RDONLY | PG_BINARY, S_IRWXU);
+	fd = OpenFile(fnamebuf, O_RDONLY | PG_BINARY, S_IRWXU);
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -481,7 +481,7 @@ lo_import_internal(text *filename, Oid lobjOid)
 	 */
 	lobj = inv_open(oid, INV_WRITE, fscxt);
 
-	while ((nbytes = FileRead(fd, buf, BUFSIZE)) > 0)
+	while ((nbytes = read(fd, buf, BUFSIZE)) > 0)
 	{
 		tmp = inv_write(lobj, buf, nbytes);
 		Assert(tmp == nbytes);
@@ -494,7 +494,7 @@ lo_import_internal(text *filename, Oid lobjOid)
 						fnamebuf)));
 
 	inv_close(lobj);
-	FileClose(fd);
+	CloseFile(fd);
 
 	return oid;
 }
@@ -508,7 +508,7 @@ lo_export(PG_FUNCTION_ARGS)
 {
 	Oid			lobjId = PG_GETARG_OID(0);
 	text	   *filename = PG_GETARG_TEXT_PP(1);
-	File		fd;
+	int			fd;
 	int			nbytes,
 				tmp;
 	char		buf[BUFSIZE];
@@ -540,7 +540,7 @@ lo_export(PG_FUNCTION_ARGS)
 	 */
 	text_to_cstring_buffer(filename, fnamebuf, sizeof(fnamebuf));
 	oumask = umask(S_IWGRP | S_IWOTH);
-	fd = PathNameOpenFile(fnamebuf, O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY,
+	fd = OpenFile(fnamebuf, O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY,
 						  S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
 	umask(oumask);
 	if (fd < 0)
@@ -554,7 +554,7 @@ lo_export(PG_FUNCTION_ARGS)
 	 */
 	while ((nbytes = inv_read(lobj, buf, BUFSIZE)) > 0)
 	{
-		tmp = FileWrite(fd, buf, nbytes);
+		tmp = write(fd, buf, nbytes);
 		if (tmp != nbytes)
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -562,7 +562,7 @@ lo_export(PG_FUNCTION_ARGS)
 							fnamebuf)));
 	}
 
-	FileClose(fd);
+	CloseFile(fd);
 	inv_close(lobj);
 
 	PG_RETURN_INT32(1);
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index cf47708..44c66db 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -162,13 +162,13 @@ copy_file(char *fromfile, char *tofile)
 	/*
 	 * Open the files
 	 */
-	srcfd = BasicOpenFile(fromfile, O_RDONLY | PG_BINARY, 0);
+	srcfd = OpenFile(fromfile, O_RDONLY | PG_BINARY, 0);
 	if (srcfd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", fromfile)));
 
-	dstfd = BasicOpenFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+	dstfd = OpenFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
 						  S_IRUSR | S_IWUSR);
 	if (dstfd < 0)
 		ereport(ERROR,
@@ -209,12 +209,12 @@ copy_file(char *fromfile, char *tofile)
 		(void) pg_flush_data(dstfd, offset, nbytes);
 	}
 
-	if (close(dstfd))
+	if (CloseFile(dstfd))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tofile)));
 
-	close(srcfd);
+	CloseFile(srcfd);
 
 	pfree(buffer);
 }
@@ -238,11 +238,11 @@ fsync_fname(char *fname, bool isdir)
 	 * cases here
 	 */
 	if (!isdir)
-		fd = BasicOpenFile(fname,
+		fd = OpenFile(fname,
 						   O_RDWR | PG_BINARY,
 						   S_IRUSR | S_IWUSR);
 	else
-		fd = BasicOpenFile(fname,
+		fd = OpenFile(fname,
 						   O_RDONLY | PG_BINARY,
 						   S_IRUSR | S_IWUSR);
 
@@ -263,7 +263,7 @@ fsync_fname(char *fname, bool isdir)
 	/* Some OSs don't allow us to fsync directories at all */
 	if (returncode != 0 && isdir && errno == EBADF)
 	{
-		close(fd);
+		CloseFile(fd);
 		return;
 	}
 
@@ -272,5 +272,5 @@ fsync_fname(char *fname, bool isdir)
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", fname)));
 
-	close(fd);
+	CloseFile(fd);
 }
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index ecb62ba..0bb74d4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -30,11 +30,29 @@
  * routines (e.g., open(2) and fopen(3)) themselves.  Otherwise, we
  * may find ourselves short of real file descriptors anyway.
  *
- * This file used to contain a bunch of stuff to support RAID levels 0
- * (jbod), 1 (duplex) and 5 (xor parity).  That stuff is all gone
- * because the parallel query processing code that called it is all
- * gone.  If you really need it you could get it from the original
- * POSTGRES source.
+ * INTERFACE ROUTINES
+ *
+ * PathNameOpenFile and OpenTemporaryFile are used to open virtual files.
+ * A File opened with OpenTemporaryFile is automatically deleted when the
+ * File is closed, either explicitly or implicitly at end of transaction or
+ * process exit. PathNameOpenFile is intended for files that are held open
+ * for a long time, like relation files. It is the caller's responsibility
+ * to close them, there is no automatic mechanism in fd.c for that.
+ *
+ * AllocateFile, AllocateDir and OpenFile are wrappers around fopen(3),
+ * opendir(3), and open(2), respectively. They behave like the corresponding
+ * native functions, except that the handle is registered with the current
+ * subtransaction, and will be automatically closed at abort. These are
+ * intended for short operations like reading a configuration file. There is
+ * a fixed limit on the number files that can be open using these functions
+ * at any one time.
+ *
+ * Finally, BasicOpenFile is a just thin wrapper around open() that can
+ * release file descriptors in use by the virtual file descriptors if
+ * necessary. There is no automatic cleanup of file descriptors returned by
+ * BasicOpenFile, it is solely the caller's responsibility to close the file
+ * descriptor by calling close(2).
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -94,7 +112,7 @@ int			max_files_per_process = 1000;
 
 /*
  * Maximum number of file descriptors to open for either VFD entries or
- * AllocateFile/AllocateDir operations.  This is initialized to a conservative
+ * AllocateFile/AllocateDir/OpenFile operations.  This is initialized to a conservative
  * value, and remains that way indefinitely in bootstrap or standalone-backend
  * cases.  In normal postmaster operation, the postmaster calls
  * set_max_safe_fds() late in initialization to update the value, and that
@@ -171,10 +189,9 @@ static bool have_xact_temporary_files = false;
 static uint64 temporary_files_size = 0;
 
 /*
- * List of stdio FILEs and <dirent.h> DIRs opened with AllocateFile
- * and AllocateDir.
+ * List of OS handles opened with AllocateFile, AllocateDir and OpenFile.
  *
- * Since we don't want to encourage heavy use of AllocateFile or AllocateDir,
+ * Since we don't want to encourage heavy use of those functions,
  * it seems OK to put a pretty small maximum limit on the number of
  * simultaneously allocated descs.
  */
@@ -183,7 +200,8 @@ static uint64 temporary_files_size = 0;
 typedef enum
 {
 	AllocateDescFile,
-	AllocateDescDir
+	AllocateDescDir,
+	AllocateDescRawFD
 } AllocateDescKind;
 
 typedef struct
@@ -193,6 +211,7 @@ typedef struct
 	{
 		FILE	   *file;
 		DIR		   *dir;
+		int			fd;
 	}			desc;
 	SubTransactionId create_subid;
 } AllocateDesc;
@@ -1458,7 +1477,6 @@ FilePathName(File file)
 	return VfdCache[file].fileName;
 }
 
-
 /*
  * Routines that want to use stdio (ie, FILE*) should use AllocateFile
  * rather than plain fopen().  This lets fd.c deal with freeing FDs if
@@ -1523,8 +1541,45 @@ TryAgain:
 	return NULL;
 }
 
+
+/*
+ * OpenFile --- Like AllocateFile, but returns an unbuffered fd like open(2)
+ */
+int
+OpenFile(FileName fileName, int fileFlags, int fileMode)
+{
+	int			fd;
+
+	/*
+	 * The test against MAX_ALLOCATED_DESCS prevents us from overflowing
+	 * allocatedFiles[]; the test against max_safe_fds prevents BasicOpenFile
+	 * from hogging every one of the available FDs, which'd lead to infinite
+	 * looping.
+	 */
+	if (numAllocatedDescs >= MAX_ALLOCATED_DESCS ||
+		numAllocatedDescs >= max_safe_fds - 1)
+		elog(ERROR, "exceeded MAX_ALLOCATED_DESCS while trying to open file \"%s\"",
+			 fileName);
+
+	fd = BasicOpenFile(fileName, fileFlags, fileMode);
+
+	if (fd >= 0)
+	{
+		AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
+
+		desc->kind = AllocateDescRawFD;
+		desc->desc.fd = fd;
+		desc->create_subid = GetCurrentSubTransactionId();
+		numAllocatedDescs++;
+
+		return fd;
+	}
+
+	return -1;					/* failure */
+}
+
 /*
- * Free an AllocateDesc of either type.
+ * Free an AllocateDesc of any type.
  *
  * The argument *must* point into the allocatedDescs[] array.
  */
@@ -1542,6 +1597,9 @@ FreeDesc(AllocateDesc *desc)
 		case AllocateDescDir:
 			result = closedir(desc->desc.dir);
 			break;
+		case AllocateDescRawFD:
+			result = close(desc->desc.fd);
+			break;
 		default:
 			elog(ERROR, "AllocateDesc kind not recognized");
 			result = 0;			/* keep compiler quiet */
@@ -1583,6 +1641,33 @@ FreeFile(FILE *file)
 	return fclose(file);
 }
 
+/*
+ * Close a file returned by OpenFile.
+ *
+ * Note we do not check close's return value --- it is up to the caller
+ * to handle close errors.
+ */
+int
+CloseFile(int fd)
+{
+	int			i;
+
+	DO_DB(elog(LOG, "CloseFile: Allocated %d", numAllocatedDescs));
+
+	/* Remove fd from list of allocated files, if it's present */
+	for (i = numAllocatedDescs; --i >= 0;)
+	{
+		AllocateDesc *desc = &allocatedDescs[i];
+
+		if (desc->kind == AllocateDescRawFD && desc->desc.fd == fd)
+			return FreeDesc(desc);
+	}
+
+	/* Only get here if someone passes us a file not in allocatedDescs */
+	elog(WARNING, "fd passed to CloseFile was not obtained from OpenFile");
+
+	return close(fd);
+}
 
 /*
  * Routines that want to use <dirent.h> (ie, DIR*) should use AllocateDir
@@ -1874,7 +1959,7 @@ AtProcExit_Files(int code, Datum arg)
  * exiting. If that's the case, we should remove all temporary files; if
  * that's not the case, we are being called for transaction commit/abort
  * and should only remove transaction-local temp files.  In either case,
- * also clean up "allocated" stdio files and dirs.
+ * also clean up "allocated" stdio files, dirs and fds.
  */
 static void
 CleanupTempFiles(bool isProcExit)
@@ -1916,7 +2001,7 @@ CleanupTempFiles(bool isProcExit)
 		have_xact_temporary_files = false;
 	}
 
-	/* Clean up "allocated" stdio files and dirs. */
+	/* Clean up "allocated" stdio files, dirs and fds. */
 	while (numAllocatedDescs > 0)
 		FreeDesc(&allocatedDescs[0]);
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3f4ab49..b54d3a2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -401,14 +401,14 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 		/* truncate(2) would be easier here, but Windows hasn't got it */
 		int			fd;
 
-		fd = BasicOpenFile(path, O_RDWR | PG_BINARY, 0);
+		fd = OpenFile(path, O_RDWR | PG_BINARY, 0);
 		if (fd >= 0)
 		{
 			int			save_errno;
 
 			ret = ftruncate(fd, 0);
 			save_errno = errno;
-			close(fd);
+			CloseFile(fd);
 			errno = save_errno;
 		}
 		else
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 6f21495..787b2c7 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -588,7 +588,7 @@ load_relmap_file(bool shared)
 	}
 
 	/* Read data ... */
-	fd = BasicOpenFile(mapfilename, O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR);
+	fd = OpenFile(mapfilename, O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR);
 	if (fd < 0)
 		ereport(FATAL,
 				(errcode_for_file_access(),
@@ -608,7 +608,7 @@ load_relmap_file(bool shared)
 				 errmsg("could not read relation mapping file \"%s\": %m",
 						mapfilename)));
 
-	close(fd);
+	CloseFile(fd);
 
 	/* check for correct magic number, etc */
 	if (map->magic != RELMAPPER_FILEMAGIC ||
@@ -672,12 +672,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/*
 	 * Open the target file.  We prefer to do this before entering the
 	 * critical section, so that an open() failure need not force PANIC.
-	 *
-	 * Note: since we use BasicOpenFile, we are nominally responsible for
-	 * ensuring the fd is closed on error.	In practice, this isn't important
-	 * because either an error happens inside the critical section, or we are
-	 * in bootstrap or WAL replay; so an error past this point is always fatal
-	 * anyway.
 	 */
 	if (shared)
 	{
@@ -692,7 +686,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		realmap = &local_map;
 	}
 
-	fd = BasicOpenFile(mapfilename,
+	fd = OpenFile(mapfilename,
 					   O_WRONLY | O_CREAT | PG_BINARY,
 					   S_IRUSR | S_IWUSR);
 	if (fd < 0)
@@ -753,7 +747,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 				 errmsg("could not fsync relation mapping file \"%s\": %m",
 						mapfilename)));
 
-	if (close(fd))
+	if (CloseFile(fd))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close relation mapping file \"%s\": %m",
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 849bb10..61530b6 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -16,13 +16,13 @@
  * calls:
  *
  *	File {Close, Read, Write, Seek, Tell, Sync}
- *	{File Name Open, Allocate, Free} File
+ *	{Path Name Open, Allocate, Free} File
  *
  * These are NOT JUST RENAMINGS OF THE UNIX ROUTINES.
  * Use them for all file activity...
  *
  *	File fd;
- *	fd = FilePathOpenFile("foo", O_RDONLY, 0600);
+ *	fd = PathNameOpenFile("foo", O_RDONLY, 0600);
  *
  *	AllocateFile();
  *	FreeFile();
@@ -33,7 +33,8 @@
  * no way for them to share kernel file descriptors with other files.
  *
  * Likewise, use AllocateDir/FreeDir, not opendir/closedir, to allocate
- * open directories (DIR*).
+ * open directories (DIR*). And OpenFile/CloseFile for an unbuffered
+ * file descriptor.
  */
 #ifndef FD_H
 #define FD_H
@@ -84,6 +85,10 @@ extern DIR *AllocateDir(const char *dirname);
 extern struct dirent *ReadDir(DIR *dir, const char *dirname);
 extern int	FreeDir(DIR *dir);
 
+/* Operations to allow use of a plain kernel FD, with automatic cleanup */
+extern int	OpenFile(FileName fileName, int fileFlags, int fileMode);
+extern int	CloseFile(int fd);
+
 /* If you've really really gotta have a plain kernel FD, use this */
 extern int	BasicOpenFile(FileName fileName, int fileFlags, int fileMode);
 
#54Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#53)
Re: Plugging fd leaks (was Re: Switching timeline over streaming replication)

On Friday, November 23, 2012 7:03 PM Heikki Linnakangas

On 15.11.2012 17:16, Heikki Linnakangas wrote:

On 15.11.2012 16:55, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

This is a fairly general issue, actually. Looking around, I can see
at least two similar cases in existing code, with BasicOpenFile,
where we will leak file descriptors on error:

Um, don't we automatically clean those up during transaction abort?

Not the ones allocated with PathNameOpenFile or BasicOpenFile. Files
allocated with AllocateFile() and OpenTemporaryFile() are cleaned up
at abort.

If we don't, we ought to think about that, not about cluttering
calling code with certain-to-be-inadequate cleanup in error cases.

Agreed. Cleaning up at end-of-xact won't help walsender or other
non-backend processes, though, because they don't do transactions. But
a top-level ResourceOwner that's reset in the sigsetjmp() cleanup
routine would work.

This is what I came up with. It adds a new function, OpenFile, that
returns a raw file descriptor like BasicOpenFile, but the file
descriptor is associated with the current subtransaction and
automatically closed at end-of-xact, in the same way that AllocateFile
and AllocateDir work. In other words, OpenFile is to open() what
AllocateFile is to fopen(). BasicOpenFile is unchanged, it returns a raw
fd and it's solely the caller's responsibility to close it, but many of
the places that used to call BasicOpenFile now use the safer OpenFile
function instead.

This patch plugs three existing fd (or virtual fd) leaks:

1. copy_file() - fixed by by using OpenFile instead of BasicOpenFile 2.
XLogFileLinit() - fixed by adding close() calls to the error cases.
Can't use OpenFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenFile instead of
PathNameOpenFile.

I have gone through the patch and find it okay except for one minor
suggestion
1. Can we put below log in OpenFile as well
+ DO_DB(elog(LOG, "CloseFile: Allocated %d", numAllocatedDescs));

One thing I'm not too fond of is the naming. I'm calling the new
functions OpenFile and CloseFile. There's some danger of confusion
there, as the function to close a virtual file opened with
PathNameOpenFile is called FileClose. OpenFile is really the same kind
of operation as AllocateFile and AllocateDir, but returns an unbuffered
fd. So it would be nice if it was called AllocateSomething, too. But
AllocateFile is already taken. And I don't much like the Allocate*
naming for these anyway, you really would expect the name to contain
"open".

OpenFileInTrans
OpenTransactionAwareFile

In anycase OpenFile is also okay.

I have one usecase in feature (SQL Command to edit postgresql.conf) very
similar to OpenFile/CloseFile, but I want that when CloseFile is called from
abort, it should remove(unlink) the file as well and during open it has to
retry few times if open is not success.
I have following options:
1. Extend OpenFile/CloseFile or PathNameOpenFile
2. Write new functions similar to OpenFile/CloseFile, something like
OpenConfLockFile/CloseConfLockFile
3. Use OpenFile/CloseFile and handle my specific case with PG_TRY ..
PG_CATCH

Any suggestions?

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#54)
Re: Plugging fd leaks (was Re: Switching timeline over streaming replication)

On 26.11.2012 14:53, Amit Kapila wrote:

I have one usecase in feature (SQL Command to edit postgresql.conf) very
similar to OpenFile/CloseFile, but I want that when CloseFile is called from
abort, it should remove(unlink) the file as well and during open it has to
retry few times if open is not success.
I have following options:
1. Extend OpenFile/CloseFile or PathNameOpenFile
2. Write new functions similar to OpenFile/CloseFile, something like
OpenConfLockFile/CloseConfLockFile
3. Use OpenFile/CloseFile and handle my specific case with PG_TRY ..
PG_CATCH

Any suggestions?

Hmm, if it's just for locking purposes, how about using a lwlock or a
heavy-weight lock instead?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#55)
Re: Plugging fd leaks (was Re: Switching timeline over streaming replication)

On Monday, November 26, 2012 7:01 PM Heikki Linnakangas wrote:

On 26.11.2012 14:53, Amit Kapila wrote:

I have one usecase in feature (SQL Command to edit postgresql.conf)

very

similar to OpenFile/CloseFile, but I want that when CloseFile is

called from

abort, it should remove(unlink) the file as well and during open it

has to

retry few times if open is not success.
I have following options:
1. Extend OpenFile/CloseFile or PathNameOpenFile
2. Write new functions similar to OpenFile/CloseFile, something like
OpenConfLockFile/CloseConfLockFile
3. Use OpenFile/CloseFile and handle my specific case with PG_TRY ..
PG_CATCH

Any suggestions?

Hmm, if it's just for locking purposes, how about using a lwlock or a
heavy-weight lock instead?

Its not only for lock, the main idea is that we create temp file and write
modified configuration in that temp file.
In end if it's success, then we rename temp file to .conf file but if it
error out then at abort we need to delete temp file.

So in short, main point is to close/rename the file in case of success (at
end of command) and remove in case of abort.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#56)
Re: Plugging fd leaks (was Re: Switching timeline over streaming replication)

Amit Kapila <amit.kapila@huawei.com> writes:

On Monday, November 26, 2012 7:01 PM Heikki Linnakangas wrote:

Hmm, if it's just for locking purposes, how about using a lwlock or a
heavy-weight lock instead?

Its not only for lock, the main idea is that we create temp file and write
modified configuration in that temp file.
In end if it's success, then we rename temp file to .conf file but if it
error out then at abort we need to delete temp file.

So in short, main point is to close/rename the file in case of success (at
end of command) and remove in case of abort.

I'd go with the TRY/CATCH solution. It would be worth extending the
fd.c infrastructure if there were multiple users of the feature, but
there are not, nor do I see likely new candidates on the horizon.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#54)
Re: Plugging fd leaks (was Re: Switching timeline over streaming replication)

On 26.11.2012 14:53, Amit Kapila wrote:

On Friday, November 23, 2012 7:03 PM Heikki Linnakangas

This is what I came up with. It adds a new function, OpenFile, that
returns a raw file descriptor like BasicOpenFile, but the file
descriptor is associated with the current subtransaction and
automatically closed at end-of-xact, in the same way that AllocateFile
and AllocateDir work. In other words, OpenFile is to open() what
AllocateFile is to fopen(). BasicOpenFile is unchanged, it returns a raw
fd and it's solely the caller's responsibility to close it, but many of
the places that used to call BasicOpenFile now use the safer OpenFile
function instead.

This patch plugs three existing fd (or virtual fd) leaks:

1. copy_file() - fixed by by using OpenFile instead of BasicOpenFile 2.
XLogFileLinit() - fixed by adding close() calls to the error cases.
Can't use OpenFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenFile instead of
PathNameOpenFile.

I have gone through the patch and find it okay except for one minor
suggestion
1. Can we put below log in OpenFile as well
+ DO_DB(elog(LOG, "CloseFile: Allocated %d", numAllocatedDescs));

Thanks. Added that and committed.

I didn't dare to backpatch this, even though it could be fairly easily
backpatched. The leaks exist in older versions too, but since they're
extremely rare (zero complaints from the field and it's been like that
forever), I didn't want to take the risk. Maybe later, after this has
had more testing in master.

One thing I'm not too fond of is the naming. I'm calling the new
functions OpenFile and CloseFile. There's some danger of confusion
there, as the function to close a virtual file opened with
PathNameOpenFile is called FileClose. OpenFile is really the same kind
of operation as AllocateFile and AllocateDir, but returns an unbuffered
fd. So it would be nice if it was called AllocateSomething, too. But
AllocateFile is already taken. And I don't much like the Allocate*
naming for these anyway, you really would expect the name to contain
"open".

OpenFileInTrans
OpenTransactionAwareFile

In anycase OpenFile is also okay.

I ended up calling the functions OpenTransientFile and
CloseTransientFile. Windows has a library function called "OpenFile", so
that was a pretty bad choice after all.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59senthilnathan
senthilnathan.t@gmail.com
In reply to: Amit Kapila (#20)
Re: Switching timeline over streaming replication

Is this patch available in version 9.2.1 ?

Senthil

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Switching-timeline-over-streaming-replication-tp5723547p5734744.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: senthilnathan (#59)
Re: Switching timeline over streaming replication

On 03.12.2012 14:21, senthilnathan wrote:

Is this patch available in version 9.2.1 ?

Nope, this is for 9.3.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#50)
1 attachment(s)
Re: Switching timeline over streaming replication

After some diversions to fix bugs and refactor existing code, I've
committed a couple of small parts of this patch, which just add some
sanity checks to notice incorrect PITR scenarios. Here's a new version
of the main patch based on current HEAD.

- Heikki

Attachments:

streaming-tli-switch-8.patch.gzapplication/x-gzip; name=streaming-tli-switch-8.patch.gzDownload
#62Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#61)
Re: Switching timeline over streaming replication

On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:

After some diversions to fix bugs and refactor existing code, I've
committed a couple of small parts of this patch, which just add some
sanity checks to notice incorrect PITR scenarios. Here's a new version
of the main patch based on current HEAD.

After testing with the new patch, the following problems are observed.

Defect - 1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. start another standby D following C.
5. Promote standby B.
6. After successful time line switch in cascade standby C & D, stop D.
7. Restart D, Startup is successful and connecting to standby C.
8. Stop C.
9. Restart C, startup is failing.

Defect-2:
1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. Start another standby D following C.
5. Execute the following commands in the primary A.
create table tbl(f int);
insert into tbl values(generate_series(1,1000));
6. Promote standby B.
7. Execute the following commands in the primary B.
insert into tbl values(generate_series(1001,2000));
insert into tbl values(generate_series(2001,3000));
8. Stop standby D normally and restart D. Restart is failing.
9. Stop standby C normally and restart C. Restart is failing.

Note: Stop the node means doing a smart shutdown.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#62)
1 attachment(s)
Re: Switching timeline over streaming replication

On 05.12.2012 14:32, Amit Kapila wrote:

On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:

After some diversions to fix bugs and refactor existing code, I've
committed a couple of small parts of this patch, which just add some
sanity checks to notice incorrect PITR scenarios. Here's a new version
of the main patch based on current HEAD.

After testing with the new patch, the following problems are observed.

Defect - 1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. start another standby D following C.
5. Promote standby B.
6. After successful time line switch in cascade standby C& D, stop D.
7. Restart D, Startup is successful and connecting to standby C.
8. Stop C.
9. Restart C, startup is failing.

Ok, the error I get in that scenario is:

C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not
contain minimum recovery point 0/3023F08 on timeline 1
C 2012-12-05 19:55:43.841 EET 9282 LOG: startup process (PID 9283)
exited with exit code 1
C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to startup
process failure

It seems that the commits I made to master already:

http://archives.postgresql.org/pgsql-committers/2012-12/msg00116.php
http://archives.postgresql.org/pgsql-committers/2012-12/msg00111.php

were a few bricks shy of a load. The problem is that if recovery stops
at a checkpoint record that changes timeline, so that minRecoveryPoint
points to the end of the checkpoint record, we still record the old TLI
as the TLI of minRecoveryPoint. This is because 1) there's a silly bug
in the patch; replayEndTLI is not updated along with replayEndRecPtr.
But even after fixing that, we're not good.

The problem is that replayEndRecPtr is currently updated *before*
calling the redo function. replayEndRecPtr is what becomes
minRecoveryPoint when XLogFlush() is called. If the record is a
checkpoint record, redoing it will switch recovery to the new timeline,
but replayEndTLI will not be updated until the next record.

IOW, as far as minRecoveryPoint is concerned, a checkpoint record that
switches timeline is considered to be part of the old timeline. But when
a server is promoted and a new timeline is created, the checkpoint
record is considered to be part of the new timeline; that's what we
write in the page header and in the control file.

That mismatch causes the error. I'd like to fix this by always treating
the checkpoint record to be part of the new timeline. That feels more
correct. The most straightforward way to implement that would be to peek
at the xlog record before updating replayEndRecPtr and replayEndTLI. If
it's a checkpoint record that changes TLI, set replayEndTLI to the new
timeline before calling the redo-function. But it's a bit of a
modularity violation to peek into the record like that.

Or we could just revert the sanity check at beginning of recovery that
throws the "requested timeline 2 does not contain minimum recovery point
0/3023F08 on timeline 1" error. The error I added to redo of checkpoint
record that says "unexpected timeline ID %u in checkpoint record, before
reaching minimum recovery point %X/%X on timeline %u" checks basically
the same thing, but at a later stage. However, the way
minRecoveryPointTLI is updated still seems wrong to me, so I'd like to
fix that.

I'm thinking of something like the attached (with some more comments
before committing). Thoughts?

- Heikki

Attachments:

fix-minrecoverypointtli.patchtext/x-diff; name=fix-minrecoverypointtli.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 702ea7c..bdae7a4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5822,6 +5822,7 @@ StartupXLOG(void)
 			 */
 			do
 			{
+				TimeLineID EndTLI;
 #ifdef WAL_DEBUG
 				if (XLOG_DEBUG ||
 				 (rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) ||
@@ -5895,8 +5896,20 @@ StartupXLOG(void)
 				 * Update shared replayEndRecPtr before replaying this record,
 				 * so that XLogFlush will update minRecoveryPoint correctly.
 				 */
+				if (record->xl_rmid == RM_XLOG_ID &&
+					(record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN)
+				{
+					CheckPoint	checkPoint;
+
+					memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
+					EndTLI = checkPoint.ThisTimeLineID;
+				}
+				else
+					EndTLI = ThisTimeLineID;
+
 				SpinLockAcquire(&xlogctl->info_lck);
 				xlogctl->replayEndRecPtr = EndRecPtr;
+				xlogctl->replayEndTLI = EndTLI;
 				recoveryPause = xlogctl->recoveryPause;
 				SpinLockRelease(&xlogctl->info_lck);
 
#64Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#63)
Re: Switching timeline over streaming replication

On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote:

On 05.12.2012 14:32, Amit Kapila wrote:

On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:

After some diversions to fix bugs and refactor existing code, I've
committed a couple of small parts of this patch, which just add some
sanity checks to notice incorrect PITR scenarios. Here's a new
version of the main patch based on current HEAD.

After testing with the new patch, the following problems are observed.

Defect - 1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. start another standby D following C.
5. Promote standby B.
6. After successful time line switch in cascade standby C& D,

stop D.

7. Restart D, Startup is successful and connecting to standby C.
8. Stop C.
9. Restart C, startup is failing.

Ok, the error I get in that scenario is:

C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not
contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05
19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit
code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to
startup process failure

That mismatch causes the error. I'd like to fix this by always treating
the checkpoint record to be part of the new timeline. That feels more
correct. The most straightforward way to implement that would be to peek
at the xlog record before updating replayEndRecPtr and replayEndTLI. If
it's a checkpoint record that changes TLI, set replayEndTLI to the new
timeline before calling the redo-function. But it's a bit of a
modularity violation to peek into the record like that.

Or we could just revert the sanity check at beginning of recovery that
throws the "requested timeline 2 does not contain minimum recovery point
0/3023F08 on timeline 1" error. The error I added to redo of checkpoint
record that says "unexpected timeline ID %u in checkpoint record, before
reaching minimum recovery point %X/%X on timeline %u" checks basically
the same thing, but at a later stage. However, the way
minRecoveryPointTLI is updated still seems wrong to me, so I'd like to
fix that.

I'm thinking of something like the attached (with some more comments
before committing). Thoughts?

This has fixed the problem reported.
However, I am not able to think will there be any problem if we remove check
"requested timeline 2 does not contain minimum recovery point

0/3023F08 on timeline 1" at beginning of recovery and just update

replayEndTLI with ThisTimeLineID?

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#64)
1 attachment(s)
Re: Switching timeline over streaming replication

On 06.12.2012 15:39, Amit Kapila wrote:

On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote:

On 05.12.2012 14:32, Amit Kapila wrote:

On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:

After some diversions to fix bugs and refactor existing code, I've
committed a couple of small parts of this patch, which just add some
sanity checks to notice incorrect PITR scenarios. Here's a new
version of the main patch based on current HEAD.

After testing with the new patch, the following problems are observed.

Defect - 1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. start another standby D following C.
5. Promote standby B.
6. After successful time line switch in cascade standby C& D,

stop D.

7. Restart D, Startup is successful and connecting to standby C.
8. Stop C.
9. Restart C, startup is failing.

Ok, the error I get in that scenario is:

C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not
contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05
19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit
code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to
startup process failure

That mismatch causes the error. I'd like to fix this by always treating
the checkpoint record to be part of the new timeline. That feels more
correct. The most straightforward way to implement that would be to peek
at the xlog record before updating replayEndRecPtr and replayEndTLI. If
it's a checkpoint record that changes TLI, set replayEndTLI to the new
timeline before calling the redo-function. But it's a bit of a
modularity violation to peek into the record like that.

Or we could just revert the sanity check at beginning of recovery that
throws the "requested timeline 2 does not contain minimum recovery point
0/3023F08 on timeline 1" error. The error I added to redo of checkpoint
record that says "unexpected timeline ID %u in checkpoint record, before
reaching minimum recovery point %X/%X on timeline %u" checks basically
the same thing, but at a later stage. However, the way
minRecoveryPointTLI is updated still seems wrong to me, so I'd like to
fix that.

I'm thinking of something like the attached (with some more comments
before committing). Thoughts?

This has fixed the problem reported.
However, I am not able to think will there be any problem if we remove check
"requested timeline 2 does not contain minimum recovery point

0/3023F08 on timeline 1" at beginning of recovery and just update

replayEndTLI with ThisTimeLineID?

Well, it seems wrong for the control file to contain a situation like this:

pg_control version number: 932
Catalog version number: 201211281
Database system identifier: 5819228770976387006
Database cluster state: shut down in recovery
pg_control last modified: pe 7. joulukuuta 2012 17.39.57
Latest checkpoint location: 0/3023EA8
Prior checkpoint location: 0/2000060
Latest checkpoint's REDO location: 0/3023EA8
Latest checkpoint's REDO WAL file: 000000020000000000000003
Latest checkpoint's TimeLineID: 2
...
Time of latest checkpoint: pe 7. joulukuuta 2012 17.39.49
Min recovery ending location: 0/3023F08
Min recovery ending loc's timeline: 1

Note the latest checkpoint location and its TimelineID, and compare them
with the min recovery ending location. The min recovery ending location
is ahead of latest checkpoint's location; the min recovery ending
location actually points to the end of the checkpoint record. But how
come the min recovery ending location's timeline is 1, while the
checkpoint record's timeline is 2.

Now maybe that would happen to work if remove the sanity check, but it
still seems horribly confusing. I'm afraid that discrepancy will come
back to haunt us later if we leave it like that. So I'd like to fix that.

Mulling over this for some more, I propose the attached patch. With the
patch, we peek into the checkpoint record, and actually perform the
timeline switch (by changing ThisTimeLineID) before replaying it. That
way the checkpoint record is really considered to be on the new timeline
for all purposes. At the moment, the only difference that makes in
practice is that we set replayEndTLI, and thus minRecoveryPointTLI, to
the new TLI, but it feels logically more correct to do it that way.

- Heikki

Attachments:

fix-minrecoverypointtli-2.patchtext/x-diff; name=fix-minrecoverypointtli-2.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2618c8d..9bd7f03 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -605,6 +605,7 @@ static void SetLatestXTime(TimestampTz xtime);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void CheckRequiredParameterValues(void);
 static void XLogReportParameters(void);
+static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI);
 static void LocalSetXLogInsertAllowed(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
@@ -5910,11 +5911,40 @@ StartupXLOG(void)
 				}
 
 				/*
+				 * Before replaying this record, check if it is a shutdown
+				 * checkpoint record that causes the current timeline to
+				 * change. The checkpoint record is already considered to be
+				 * part of the new timeline, so we update ThisTimeLineID
+				 * before replaying it. That's important so that replayEndTLI,
+				 * which is recorded as the minimum recovery point's TLI if
+				 * recovery stops after this record, is set correctly.
+				 */
+				if (record->xl_rmid == RM_XLOG_ID &&
+					(record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN)
+				{
+					CheckPoint	checkPoint;
+					TimeLineID	newTLI;
+
+					memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
+					newTLI = checkPoint.ThisTimeLineID;
+
+					if (newTLI != ThisTimeLineID)
+					{
+						/* Check that it's OK to switch to this TLI */
+						checkTimeLineSwitch(EndRecPtr, newTLI);
+
+						/* Following WAL records should be run with new TLI */
+						ThisTimeLineID = newTLI;
+					}
+				}
+
+				/*
 				 * Update shared replayEndRecPtr before replaying this record,
 				 * so that XLogFlush will update minRecoveryPoint correctly.
 				 */
 				SpinLockAcquire(&xlogctl->info_lck);
 				xlogctl->replayEndRecPtr = EndRecPtr;
+				xlogctl->replayEndTLI = ThisTimeLineID;
 				SpinLockRelease(&xlogctl->info_lck);
 
 				/*
@@ -7859,6 +7889,48 @@ UpdateFullPageWrites(void)
 }
 
 /*
+ * Check that it's OK to switch to new timeline during recovery.
+ *
+ * 'lsn' is the address of the shutdown checkpoint record we're about to
+ * replay. (Currently, timeline can only change at a shutdown checkpoint).
+ */
+static void
+checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI)
+{
+	/*
+	 * The new timeline better be in the list of timelines we expect
+	 * to see, according to the timeline history. It should also not
+	 * decrease.
+	 */
+	if (newTLI < ThisTimeLineID || !tliInHistory(newTLI, expectedTLEs))
+		ereport(PANIC,
+				(errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
+						newTLI, ThisTimeLineID)));
+
+	/*
+	 * If we have not yet reached min recovery point, and we're about
+	 * to switch to a timeline greater than the timeline of the min
+	 * recovery point: trouble. After switching to the new timeline,
+	 * we could not possibly visit the min recovery point on the
+	 * correct timeline anymore. This can happen if there is a newer
+	 * timeline in the archive that branched before the timeline the
+	 * min recovery point is on, and you attempt to do PITR to the
+	 * new timeline.
+	 */
+	if (!XLogRecPtrIsInvalid(minRecoveryPoint) &&
+		XLByteLT(lsn, minRecoveryPoint) &&
+		newTLI > minRecoveryPointTLI)
+		ereport(PANIC,
+				(errmsg("unexpected timeline ID %u in checkpoint record, before reaching minimum recovery point %X/%X on timeline %u",
+						newTLI,
+						(uint32) (minRecoveryPoint >> 32),
+						(uint32) minRecoveryPoint,
+						minRecoveryPointTLI)));
+
+	/* Looks good */
+}
+
+/*
  * XLOG resource manager's routines
  *
  * Definitions of info values are in include/catalog/pg_control.h, though
@@ -7971,44 +8043,13 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 		}
 
 		/*
-		 * TLI may change in a shutdown checkpoint.
+		 * We should've already switched to the new TLI before replaying this
+		 * record.
 		 */
 		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
-		{
-			/*
-			 * The new timeline better be in the list of timelines we expect
-			 * to see, according to the timeline history. It should also not
-			 * decrease.
-			 */
-			if (checkPoint.ThisTimeLineID < ThisTimeLineID ||
-				!tliInHistory(checkPoint.ThisTimeLineID, expectedTLEs))
-				ereport(PANIC,
-						(errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
-								checkPoint.ThisTimeLineID, ThisTimeLineID)));
-
-			/*
-			 * If we have not yet reached min recovery point, and we're about
-			 * to switch to a timeline greater than the timeline of the min
-			 * recovery point: trouble. After switching to the new timeline,
-			 * we could not possibly visit the min recovery point on the
-			 * correct timeline anymore. This can happen if there is a newer
-			 * timeline in the archive that branched before the timeline the
-			 * min recovery point is on, and you attempt to do PITR to the
-			 * new timeline.
-			 */
-			if (!XLogRecPtrIsInvalid(minRecoveryPoint) &&
-				XLByteLT(lsn, minRecoveryPoint) &&
-				checkPoint.ThisTimeLineID > minRecoveryPointTLI)
-				ereport(PANIC,
-						(errmsg("unexpected timeline ID %u in checkpoint record, before reaching minimum recovery point %X/%X on timeline %u",
-								checkPoint.ThisTimeLineID,
-								(uint32) (minRecoveryPoint >> 32),
-								(uint32) minRecoveryPoint,
-								minRecoveryPointTLI)));
-
-			/* Following WAL records should be run with new TLI */
-			ThisTimeLineID = checkPoint.ThisTimeLineID;
-		}
+			ereport(PANIC,
+					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
+							checkPoint.ThisTimeLineID, ThisTimeLineID)));
 
 		RecoveryRestartPoint(&checkPoint);
 	}
#66Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#65)
Re: Switching timeline over streaming replication

From: Heikki Linnakangas [mailto:hlinnakangas@vmware.com]
Sent: Friday, December 07, 2012 9:22 PM
To: Amit Kapila
Cc: 'PostgreSQL-development'; 'Thom Brown'
Subject: Re: [HACKERS] Switching timeline over streaming replication

On 06.12.2012 15:39, Amit Kapila wrote:

On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote:

On 05.12.2012 14:32, Amit Kapila wrote:

On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:

After some diversions to fix bugs and refactor existing code, I've
committed a couple of small parts of this patch, which just add
some sanity checks to notice incorrect PITR scenarios. Here's a new
version of the main patch based on current HEAD.

After testing with the new patch, the following problems are

observed.

Defect - 1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. start another standby D following C.
5. Promote standby B.
6. After successful time line switch in cascade standby C&

D,

stop D.

7. Restart D, Startup is successful and connecting to standby

C.

8. Stop C.
9. Restart C, startup is failing.

Ok, the error I get in that scenario is:

C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does
not contain minimum recovery point 0/3023F08 on timeline 1 C
2012-12-05
19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with
exit code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup
due to startup process failure

Well, it seems wrong for the control file to contain a situation like
this:

pg_control version number: 932
Catalog version number: 201211281
Database system identifier: 5819228770976387006
Database cluster state: shut down in recovery
pg_control last modified: pe 7. joulukuuta 2012 17.39.57
Latest checkpoint location: 0/3023EA8
Prior checkpoint location: 0/2000060
Latest checkpoint's REDO location: 0/3023EA8
Latest checkpoint's REDO WAL file: 000000020000000000000003
Latest checkpoint's TimeLineID: 2
...
Time of latest checkpoint: pe 7. joulukuuta 2012 17.39.49
Min recovery ending location: 0/3023F08
Min recovery ending loc's timeline: 1

Note the latest checkpoint location and its TimelineID, and compare them
with the min recovery ending location. The min recovery ending location
is ahead of latest checkpoint's location; the min recovery ending
location actually points to the end of the checkpoint record. But how
come the min recovery ending location's timeline is 1, while the
checkpoint record's timeline is 2.

Now maybe that would happen to work if remove the sanity check, but it
still seems horribly confusing. I'm afraid that discrepancy will come
back to haunt us later if we leave it like that. So I'd like to fix
that.

Mulling over this for some more, I propose the attached patch. With the
patch, we peek into the checkpoint record, and actually perform the
timeline switch (by changing ThisTimeLineID) before replaying it. That
way the checkpoint record is really considered to be on the new timeline
for all purposes. At the moment, the only difference that makes in
practice is that we set replayEndTLI, and thus minRecoveryPointTLI, to
the new TLI, but it feels logically more correct to do it that way.

This has fixed both the problems reported in below link:
http://archives.postgresql.org/pgsql-hackers/2012-12/msg00267.php

The code is also fine.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#65)
Re: Switching timeline over streaming replication

Heikki,

Tested this on yesterday's snapshot. Worked great.

Test:

4 Ubuntu 10.04 LTS Cloud Servers (GoGrid)
Configuration:
Compiled 9.3(12-12-12)
with: pg_stat_statements, citext, ISN, btree_gist, pl/perl

Setup Test:
Master-Master
Replicated to: master-replica using pg_basebackup -x.
No archiving.
Master-Replica
replicated to Replica-Replica1 and Replica-Replica2
using pg_basebackup -x
All came up on first try, with no issues. Ran customized pgbench (with
waits); lag time to cascading replicas was < 1 second.

Failover Test:
1. started customized pgbench on master-master.
2. shut down master-master (-fast)
3. promoted master-replica to new master
4. restarted custom pgbench, at master-replica

Result:
Replication to replica-replica1,2 working fine, no interruptions in
existing connections to replica-replicas.

Now I wanna test a chain of cascading replicas ... how far can we chain
these?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#65)
Re: Switching timeline over streaming replication

On Sat, Dec 8, 2012 at 12:51 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 06.12.2012 15:39, Amit Kapila wrote:

On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote:

On 05.12.2012 14:32, Amit Kapila wrote:

On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:

After some diversions to fix bugs and refactor existing code, I've
committed a couple of small parts of this patch, which just add some
sanity checks to notice incorrect PITR scenarios. Here's a new
version of the main patch based on current HEAD.

After testing with the new patch, the following problems are observed.

Defect - 1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. start another standby D following C.
5. Promote standby B.
6. After successful time line switch in cascade standby C& D,

stop D.

7. Restart D, Startup is successful and connecting to standby C.
8. Stop C.
9. Restart C, startup is failing.

Ok, the error I get in that scenario is:

C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not
contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05
19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit
code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to
startup process failure

That mismatch causes the error. I'd like to fix this by always treating
the checkpoint record to be part of the new timeline. That feels more
correct. The most straightforward way to implement that would be to peek
at the xlog record before updating replayEndRecPtr and replayEndTLI. If
it's a checkpoint record that changes TLI, set replayEndTLI to the new
timeline before calling the redo-function. But it's a bit of a
modularity violation to peek into the record like that.

Or we could just revert the sanity check at beginning of recovery that
throws the "requested timeline 2 does not contain minimum recovery point
0/3023F08 on timeline 1" error. The error I added to redo of checkpoint
record that says "unexpected timeline ID %u in checkpoint record, before
reaching minimum recovery point %X/%X on timeline %u" checks basically
the same thing, but at a later stage. However, the way
minRecoveryPointTLI is updated still seems wrong to me, so I'd like to
fix that.

I'm thinking of something like the attached (with some more comments
before committing). Thoughts?

This has fixed the problem reported.
However, I am not able to think will there be any problem if we remove
check
"requested timeline 2 does not contain minimum recovery point

0/3023F08 on timeline 1" at beginning of recovery and just update

replayEndTLI with ThisTimeLineID?

Well, it seems wrong for the control file to contain a situation like this:

pg_control version number: 932
Catalog version number: 201211281
Database system identifier: 5819228770976387006
Database cluster state: shut down in recovery
pg_control last modified: pe 7. joulukuuta 2012 17.39.57
Latest checkpoint location: 0/3023EA8
Prior checkpoint location: 0/2000060
Latest checkpoint's REDO location: 0/3023EA8
Latest checkpoint's REDO WAL file: 000000020000000000000003
Latest checkpoint's TimeLineID: 2
...
Time of latest checkpoint: pe 7. joulukuuta 2012 17.39.49
Min recovery ending location: 0/3023F08
Min recovery ending loc's timeline: 1

Note the latest checkpoint location and its TimelineID, and compare them
with the min recovery ending location. The min recovery ending location is
ahead of latest checkpoint's location; the min recovery ending location
actually points to the end of the checkpoint record. But how come the min
recovery ending location's timeline is 1, while the checkpoint record's
timeline is 2.

Now maybe that would happen to work if remove the sanity check, but it still
seems horribly confusing. I'm afraid that discrepancy will come back to
haunt us later if we leave it like that. So I'd like to fix that.

Mulling over this for some more, I propose the attached patch. With the
patch, we peek into the checkpoint record, and actually perform the timeline
switch (by changing ThisTimeLineID) before replaying it. That way the
checkpoint record is really considered to be on the new timeline for all
purposes. At the moment, the only difference that makes in practice is that
we set replayEndTLI, and thus minRecoveryPointTLI, to the new TLI, but it
feels logically more correct to do it that way.

This patch has already been included in HEAD. Right?

I found another "requested timeline does not contain minimum recovery point"
error scenario in HEAD:

1. Set up the master 'M', one standby 'S1', and one cascade standby 'S2'.
2. Shutdown the master 'M' and promote the standby 'S1', and wait for 'S2'
to reconnect to 'S1'.
3. Set up new cascade standby 'S3' connecting to 'S2'.
Then 'S3' fails to start the recovery because of the following error:

FATAL: requested timeline 2 does not contain minimum recovery
point 0/3000000 on timeline 1
LOG: startup process (PID 33104) exited with exit code 1
LOG: aborting startup due to startup process failure

The result of pg_controldata of 'S3' is:

Latest checkpoint location: 0/3000088
Prior checkpoint location: 0/2000060
Latest checkpoint's REDO location: 0/3000088
Latest checkpoint's REDO WAL file: 000000020000000000000003
Latest checkpoint's TimeLineID: 2
<snip>
Min recovery ending location: 0/3000000
Min recovery ending loc's timeline: 1
Backup start location: 0/0
Backup end location: 0/0

The content of the timeline history file '00000002.history' is:

1 0/3000088 no recovery target specified

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Josh Berkus (#67)
Re: Switching timeline over streaming replication

On 15.12.2012 01:09, Josh Berkus wrote:

Tested this on yesterday's snapshot. Worked great.

Thanks for the testing!

Now I wanna test a chain of cascading replicas ... how far can we chain
these?

There's no limit in theory. I tested with one master and two chained
standbys myself. Give it a shot, I'm curious to hear how it works with a
chain of a hundred standbys ;-).

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Thom Brown
thom@linux.com
In reply to: Heikki Linnakangas (#69)
Re: Switching timeline over streaming replication

On 17 December 2012 12:07, Heikki Linnakangas <hlinnakangas@vmware.com>wrote:

On 15.12.2012 01:09, Josh Berkus wrote:

Tested this on yesterday's snapshot. Worked great.

Thanks for the testing!

Now I wanna test a chain of cascading replicas ... how far can we chain

these?

There's no limit in theory. I tested with one master and two chained
standbys myself. Give it a shot, I'm curious to hear how it works with a
chain of a hundred standbys ;-).

I just set up 120 chained standbys, and for some reason I'm seeing these
errors:

LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: record with zero length at 0/301EC10
LOG: fetching timeline history file for timeline 2 from primary server
LOG: restarted WAL streaming at 0/3000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: new target timeline is 2
LOG: restarted WAL streaming at 0/3000000 on timeline 2
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 2
FATAL: error reading result of streaming command: ERROR: requested WAL
segment 000000020000000000000003 has already been removed

ERROR: requested WAL segment 000000020000000000000003 has already been
removed
LOG: started streaming WAL from primary at 0/3000000 on timeline 2
ERROR: requested WAL segment 000000020000000000000003 has already been
removed

The "End of WAL reached on timeline 2" appears on all standbys except the
one streaming directly from the primary.

However, changes continue to cascade to all standbys right to the end of
the chain (it takes several minutes to propagate however).

--
Thom

#71Josh Berkus
josh@agliodbs.com
In reply to: Thom Brown (#70)
Re: Switching timeline over streaming replication

Since Thom already did the destruction test, I only chained 7 standbies,
just to see if I could reproduce his error.

In the process, I accidentally connected one standby to itself. This
failed, but the error message wasn't very helpful; it just gave me
"FATAL: could not connect, the database system is starting up". Surely
there's some way we could tell the user they've tried to connect a
standby to itself?

Anyway, I was unable to reproduce Thom's error. I did not see the
error message he did.

Without any read queries running on the standbys, lag from master to
replica7 averaged about 0.5 seconds, ranging between 0.1 seconds and 1.2
seconds.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#71)
Re: Switching timeline over streaming replication

Heikki,

I ran into an unexpected issue while testing. I just wanted to fire up
a chain of 5 replicas to see if I could connect them in a loop.
However, I ran into a weird issue when starting up "r3": it refused to
come out of "the database is starting up" mode until I did a write on
the master. Then it came up fine.

master-->r1-->r2-->r3-->r4

I tried doing the full replication sequence (basebackup, startup, test)
with it twice and got the exact same results each time.

This is very strange because I did not encounter the same issues with r2
or r4. Nor have I seen this before in my tests.

I'm also seeing Thom's spurious error message now. Each of r2, r3 and
r4 have the following message once in their logs:

LOG: database system was interrupted while in recovery at log time
2012-12-19 02:49:34 GMT
HINT: If this has occurred more than once some data might be corrupted
and you might need to choose an earlier recovery target.

This message doesn't seem to signify anything.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Josh Berkus (#72)
Re: Switching timeline over streaming replication

On 19.12.2012 04:57, Josh Berkus wrote:

Heikki,

I ran into an unexpected issue while testing. I just wanted to fire up
a chain of 5 replicas to see if I could connect them in a loop.
However, I ran into a weird issue when starting up "r3": it refused to
come out of "the database is starting up" mode until I did a write on
the master. Then it came up fine.

master-->r1-->r2-->r3-->r4

I tried doing the full replication sequence (basebackup, startup, test)
with it twice and got the exact same results each time.

This is very strange because I did not encounter the same issues with r2
or r4. Nor have I seen this before in my tests.

Ok.. I'm going to need some more details on how to reproduce this, I'm
not seeing that when I set up four standbys.

I'm also seeing Thom's spurious error message now. Each of r2, r3 and
r4 have the following message once in their logs:

LOG: database system was interrupted while in recovery at log time
2012-12-19 02:49:34 GMT
HINT: If this has occurred more than once some data might be corrupted
and you might need to choose an earlier recovery target.

This message doesn't seem to signify anything.

Yep. You get that message when you start up the system from a base
backup that was taken from a standby server. It's just noise, it would
be nice if we could dial it down somehow.

In general, streaming replication and backups tend to be awfully noisy.
I've been meaning to go through all the messages that get printed during
normal operation and think carefully which ones are really necessary,
which ones could perhaps be merged into more compact messages. But
haven't gotten around to it; that would be a great project for someone
who actually sets up these systems regularly in production.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#73)
Re: Switching timeline over streaming replication

On 19.12.2012 15:55, Heikki Linnakangas wrote:

On 19.12.2012 04:57, Josh Berkus wrote:

Heikki,

I ran into an unexpected issue while testing. I just wanted to fire up
a chain of 5 replicas to see if I could connect them in a loop.
However, I ran into a weird issue when starting up "r3": it refused to
come out of "the database is starting up" mode until I did a write on
the master. Then it came up fine.

master-->r1-->r2-->r3-->r4

I tried doing the full replication sequence (basebackup, startup, test)
with it twice and got the exact same results each time.

This is very strange because I did not encounter the same issues with r2
or r4. Nor have I seen this before in my tests.

Ok.. I'm going to need some more details on how to reproduce this, I'm
not seeing that when I set up four standbys.

Ok, I managed to reproduce this now. The problem seems to be a timing
problem, when a standby switches to follow a new timeline. Four is not a
magic number here, it can happen with just one cascading standby too.

When the timline switch happens, for example, the standby changes
recovery target timeline from 1 to 2, at WAL position 0/30002D8, it has
all the WAL up to that WAL position. However, it only has that WAL in
file 000000010000000000000003, corresponding to timeline 1, and not in
the file 000000020000000000000003, corresponding to the new timeline.
When a cascaded standby connects, it requests to start streaming from
point 0/3000000 at timeline 2 (we always start streaming from the
beginning of a segment, to avoid leaving partially-filled segments in
pg_xlog). The walsender in the 1st standby tries to read that from file
000000020000000000000003, which does not exist yet.

The problem goes away after some time, after the 1st standby has
streamed the contents of 000000020000000000000003 and written it to
disk, and the cascaded standby reconnects. But it would be nice to avoid
that situation. I'm not sure how to do that yet, we might need to track
the timeline we're currently receiving/sending more carefully. Or
perhaps we need to copy the previous WAL segment to the new name when
switching recovery target timeline, like we do when a server is
promoted. I'll try to come up with something...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#74)
Re: Switching timeline over streaming replication

On 19.12.2012 17:27, Heikki Linnakangas wrote:

On 19.12.2012 15:55, Heikki Linnakangas wrote:

On 19.12.2012 04:57, Josh Berkus wrote:

Heikki,

I ran into an unexpected issue while testing. I just wanted to fire up
a chain of 5 replicas to see if I could connect them in a loop.
However, I ran into a weird issue when starting up "r3": it refused to
come out of "the database is starting up" mode until I did a write on
the master. Then it came up fine.

master-->r1-->r2-->r3-->r4

I tried doing the full replication sequence (basebackup, startup, test)
with it twice and got the exact same results each time.

This is very strange because I did not encounter the same issues with r2
or r4. Nor have I seen this before in my tests.

Ok.. I'm going to need some more details on how to reproduce this, I'm
not seeing that when I set up four standbys.

Ok, I managed to reproduce this now.

Hmph, no I didn't, I replied to wrong email. The problem I managed to
reproduce was the one where you get "requested WAL
segment 000000020000000000000003 has already been removed" errors,
reported by Thom.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76Joshua Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#74)
Re: Switching timeline over streaming replication

Heikki,

The problem goes away after some time, after the 1st standby has
streamed the contents of 000000020000000000000003 and written it to
disk, and the cascaded standby reconnects. But it would be nice to
avoid
that situation. I'm not sure how to do that yet, we might need to
track
the timeline we're currently receiving/sending more carefully. Or
perhaps we need to copy the previous WAL segment to the new name when
switching recovery target timeline, like we do when a server is
promoted. I'll try to come up with something...

Would it be accurate to say that this issue only happens when all of the replicated servers have no traffic?

--Josh

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77Joshua Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#75)
Re: Switching timeline over streaming replication

Heikki,

The next time I get the issue, and I'm not paying for 5 cloud servers by the hour, I'll give you a login.

--Josh

----- Original Message -----

On 19.12.2012 17:27, Heikki Linnakangas wrote:

On 19.12.2012 15:55, Heikki Linnakangas wrote:

On 19.12.2012 04:57, Josh Berkus wrote:

Heikki,

I ran into an unexpected issue while testing. I just wanted to
fire up
a chain of 5 replicas to see if I could connect them in a loop.
However, I ran into a weird issue when starting up "r3": it
refused to
come out of "the database is starting up" mode until I did a
write on
the master. Then it came up fine.

master-->r1-->r2-->r3-->r4

I tried doing the full replication sequence (basebackup, startup,
test)
with it twice and got the exact same results each time.

This is very strange because I did not encounter the same issues
with r2
or r4. Nor have I seen this before in my tests.

Ok.. I'm going to need some more details on how to reproduce this,
I'm
not seeing that when I set up four standbys.

Ok, I managed to reproduce this now.

Hmph, no I didn't, I replied to wrong email. The problem I managed to
reproduce was the one where you get "requested WAL
segment 000000020000000000000003 has already been removed" errors,
reported by Thom.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Thom Brown (#70)
Re: Switching timeline over streaming replication

On 17.12.2012 15:05, Thom Brown wrote:

I just set up 120 chained standbys, and for some reason I'm seeing these
errors:

LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: record with zero length at 0/301EC10
LOG: fetching timeline history file for timeline 2 from primary server
LOG: restarted WAL streaming at 0/3000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: new target timeline is 2
LOG: restarted WAL streaming at 0/3000000 on timeline 2
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 2
FATAL: error reading result of streaming command: ERROR: requested WAL
segment 000000020000000000000003 has already been removed

ERROR: requested WAL segment 000000020000000000000003 has already been
removed
LOG: started streaming WAL from primary at 0/3000000 on timeline 2
ERROR: requested WAL segment 000000020000000000000003 has already been
removed

I just committed a patch that should make the "requested WAL segment
000000020000000000000003 has already been removed" errors go away. The
trick was for walsenders to not switch to the new timeline until at
least one record has been replayed on it. That closes the window where
the walsender already considers the new timeline to be the latest, but
the WAL file has not been created yet.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#78)
Re: Switching timeline over streaming replication

On 2012-12-20 14:45:05 +0200, Heikki Linnakangas wrote:

On 17.12.2012 15:05, Thom Brown wrote:

I just set up 120 chained standbys, and for some reason I'm seeing these
errors:

LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: record with zero length at 0/301EC10
LOG: fetching timeline history file for timeline 2 from primary server
LOG: restarted WAL streaming at 0/3000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: new target timeline is 2
LOG: restarted WAL streaming at 0/3000000 on timeline 2
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 2
FATAL: error reading result of streaming command: ERROR: requested WAL
segment 000000020000000000000003 has already been removed

ERROR: requested WAL segment 000000020000000000000003 has already been
removed
LOG: started streaming WAL from primary at 0/3000000 on timeline 2
ERROR: requested WAL segment 000000020000000000000003 has already been
removed

I just committed a patch that should make the "requested WAL segment
000000020000000000000003 has already been removed" errors go away. The trick
was for walsenders to not switch to the new timeline until at least one
record has been replayed on it. That closes the window where the walsender
already considers the new timeline to be the latest, but the WAL file has
not been created yet.

I vote for introducing InvalidTimeLineID soon... 0 as a invalid
TimeLineID seems to spread and is annoying to grep for.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#68)
1 attachment(s)
Re: Switching timeline over streaming replication

On Sat, Dec 15, 2012 at 9:36 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Dec 8, 2012 at 12:51 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 06.12.2012 15:39, Amit Kapila wrote:

On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote:

On 05.12.2012 14:32, Amit Kapila wrote:

On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:

After some diversions to fix bugs and refactor existing code, I've
committed a couple of small parts of this patch, which just add some
sanity checks to notice incorrect PITR scenarios. Here's a new
version of the main patch based on current HEAD.

After testing with the new patch, the following problems are observed.

Defect - 1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. start another standby D following C.
5. Promote standby B.
6. After successful time line switch in cascade standby C& D,

stop D.

7. Restart D, Startup is successful and connecting to standby C.
8. Stop C.
9. Restart C, startup is failing.

Ok, the error I get in that scenario is:

C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not
contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05
19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit
code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to
startup process failure

That mismatch causes the error. I'd like to fix this by always treating
the checkpoint record to be part of the new timeline. That feels more
correct. The most straightforward way to implement that would be to peek
at the xlog record before updating replayEndRecPtr and replayEndTLI. If
it's a checkpoint record that changes TLI, set replayEndTLI to the new
timeline before calling the redo-function. But it's a bit of a
modularity violation to peek into the record like that.

Or we could just revert the sanity check at beginning of recovery that
throws the "requested timeline 2 does not contain minimum recovery point
0/3023F08 on timeline 1" error. The error I added to redo of checkpoint
record that says "unexpected timeline ID %u in checkpoint record, before
reaching minimum recovery point %X/%X on timeline %u" checks basically
the same thing, but at a later stage. However, the way
minRecoveryPointTLI is updated still seems wrong to me, so I'd like to
fix that.

I'm thinking of something like the attached (with some more comments
before committing). Thoughts?

This has fixed the problem reported.
However, I am not able to think will there be any problem if we remove
check
"requested timeline 2 does not contain minimum recovery point

0/3023F08 on timeline 1" at beginning of recovery and just update

replayEndTLI with ThisTimeLineID?

Well, it seems wrong for the control file to contain a situation like this:

pg_control version number: 932
Catalog version number: 201211281
Database system identifier: 5819228770976387006
Database cluster state: shut down in recovery
pg_control last modified: pe 7. joulukuuta 2012 17.39.57
Latest checkpoint location: 0/3023EA8
Prior checkpoint location: 0/2000060
Latest checkpoint's REDO location: 0/3023EA8
Latest checkpoint's REDO WAL file: 000000020000000000000003
Latest checkpoint's TimeLineID: 2
...
Time of latest checkpoint: pe 7. joulukuuta 2012 17.39.49
Min recovery ending location: 0/3023F08
Min recovery ending loc's timeline: 1

Note the latest checkpoint location and its TimelineID, and compare them
with the min recovery ending location. The min recovery ending location is
ahead of latest checkpoint's location; the min recovery ending location
actually points to the end of the checkpoint record. But how come the min
recovery ending location's timeline is 1, while the checkpoint record's
timeline is 2.

Now maybe that would happen to work if remove the sanity check, but it still
seems horribly confusing. I'm afraid that discrepancy will come back to
haunt us later if we leave it like that. So I'd like to fix that.

Mulling over this for some more, I propose the attached patch. With the
patch, we peek into the checkpoint record, and actually perform the timeline
switch (by changing ThisTimeLineID) before replaying it. That way the
checkpoint record is really considered to be on the new timeline for all
purposes. At the moment, the only difference that makes in practice is that
we set replayEndTLI, and thus minRecoveryPointTLI, to the new TLI, but it
feels logically more correct to do it that way.

This patch has already been included in HEAD. Right?

I found another "requested timeline does not contain minimum recovery point"
error scenario in HEAD:

1. Set up the master 'M', one standby 'S1', and one cascade standby 'S2'.
2. Shutdown the master 'M' and promote the standby 'S1', and wait for 'S2'
to reconnect to 'S1'.
3. Set up new cascade standby 'S3' connecting to 'S2'.
Then 'S3' fails to start the recovery because of the following error:

FATAL: requested timeline 2 does not contain minimum recovery
point 0/3000000 on timeline 1
LOG: startup process (PID 33104) exited with exit code 1
LOG: aborting startup due to startup process failure

The result of pg_controldata of 'S3' is:

Latest checkpoint location: 0/3000088
Prior checkpoint location: 0/2000060
Latest checkpoint's REDO location: 0/3000088
Latest checkpoint's REDO WAL file: 000000020000000000000003
Latest checkpoint's TimeLineID: 2
<snip>
Min recovery ending location: 0/3000000
Min recovery ending loc's timeline: 1
Backup start location: 0/0
Backup end location: 0/0

The content of the timeline history file '00000002.history' is:

1 0/3000088 no recovery target specified

I still could reproduce this problem. Attached is the shell script
which reproduces the problem.

Regards,

--
Fujii Masao

Attachments:

fujii_test.shapplication/x-sh; name=fujii_test.shDownload
#81Joshua Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#78)
Re: Switching timeline over streaming replication

I just committed a patch that should make the "requested WAL segment
000000020000000000000003 has already been removed" errors go away.
The
trick was for walsenders to not switch to the new timeline until at
least one record has been replayed on it. That closes the window
where
the walsender already considers the new timeline to be the latest,
but
the WAL file has not been created yet.

OK, I'll download the snapshot in a couple days and make sure this didn't breaks something else.

--Josh

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82Thom Brown
thom@linux.com
In reply to: Heikki Linnakangas (#78)
Re: Switching timeline over streaming replication

On 20 December 2012 12:45, Heikki Linnakangas <hlinnakangas@vmware.com>wrote:

On 17.12.2012 15:05, Thom Brown wrote:

I just set up 120 chained standbys, and for some reason I'm seeing these
errors:

LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: record with zero length at 0/301EC10
LOG: fetching timeline history file for timeline 2 from primary server
LOG: restarted WAL streaming at 0/3000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: new target timeline is 2
LOG: restarted WAL streaming at 0/3000000 on timeline 2
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 2
FATAL: error reading result of streaming command: ERROR: requested WAL
segment 000000020000000000000003 has already been removed

ERROR: requested WAL segment 000000020000000000000003 has already been
removed
LOG: started streaming WAL from primary at 0/3000000 on timeline 2
ERROR: requested WAL segment 000000020000000000000003 has already been
removed

I just committed a patch that should make the "requested WAL segment
000000020000000000000003 has already been removed" errors go away. The
trick was for walsenders to not switch to the new timeline until at least
one record has been replayed on it. That closes the window where the
walsender already considers the new timeline to be the latest, but the WAL
file has not been created yet.

Now I'm getting this on all standbys after promoting the first standby in a
chain.

LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: record with zero length at 0/301EC10
LOG: fetching timeline history file for timeline 2 from primary server
LOG: restarted WAL streaming at 0/3000000 on timeline 1
FATAL: could not receive data from WAL stream:
LOG: new target timeline is 2
FATAL: could not connect to the primary server: FATAL: the database
system is in recovery mode

LOG: started streaming WAL from primary at 0/3000000 on timeline 2
TRAP: FailedAssertion("!(((sentPtr) <= (SendRqstPtr)))", File:
"walsender.c", Line: 1425)
LOG: server process (PID 19917) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted while in recovery at log time
2012-12-20 23:41:23 GMT
HINT: If this has occurred more than once some data might be corrupted and
you might need to choose an earlier recovery target.
LOG: entering standby mode
FATAL: the database system is in recovery mode
LOG: redo starts at 0/2000028
LOG: consistent recovery state reached at 0/20000E8
LOG: database system is ready to accept read only connections
LOG: record with zero length at 0/301EC70
LOG: started streaming WAL from primary at 0/3000000 on timeline 2
LOG: unexpected EOF on standby connection

And if I restart the new primary, the first new standby connected to it
shows:

LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 2
FATAL: error reading result of streaming command: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

LOG: record with zero length at 0/301F1E0

However, all other standbys don't show any additional log output.

--
Thom

#83Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Thom Brown (#82)
Re: Switching timeline over streaming replication

On 21.12.2012 01:50, Thom Brown wrote:

Now I'm getting this on all standbys after promoting the first standby in a
chain.
...
TRAP: FailedAssertion("!(((sentPtr)<= (SendRqstPtr)))", File:
"walsender.c", Line: 1425)

Sigh. I'm sounding like a broken record, but I just committed another
fix for this, should work now.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84Thom Brown
thom@linux.com
In reply to: Heikki Linnakangas (#83)
Re: Switching timeline over streaming replication

On 21 December 2012 18:13, Heikki Linnakangas <hlinnakangas@vmware.com>wrote:

On 21.12.2012 01:50, Thom Brown wrote:

Now I'm getting this on all standbys after promoting the first standby in
a
chain.

...

TRAP: FailedAssertion("!(((sentPtr)<**= (SendRqstPtr)))", File:
"walsender.c", Line: 1425)

Sigh. I'm sounding like a broken record, but I just committed another fix
for this, should work now.

Thanks Heikki. Just quickly retested with a new set of 120 standbys and
all looks fine as far as the logs are concerned:

LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: record with zero length at 0/37902A0
LOG: fetching timeline history file for timeline 2 from primary server
LOG: restarted WAL streaming at 0/3000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: new target timeline is 2
LOG: restarted WAL streaming at 0/3000000 on timeline 2
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 2
LOG: record with zero length at 0/643E248
LOG: fetching timeline history file for timeline 3 from primary server
LOG: restarted WAL streaming at 0/6000000 on timeline 2
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 2
LOG: new target timeline is 3
LOG: restarted WAL streaming at 0/6000000 on timeline 3
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 3
LOG: record with zero length at 0/6BB13A8
LOG: fetching timeline history file for timeline 4 from primary server
LOG: restarted WAL streaming at 0/6000000 on timeline 3
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 3
LOG: new target timeline is 4
LOG: restarted WAL streaming at 0/6000000 on timeline 4

--
Thom

#85Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#80)
Re: Switching timeline over streaming replication

On Fri, Dec 21, 2012 at 1:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Dec 15, 2012 at 9:36 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Dec 8, 2012 at 12:51 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 06.12.2012 15:39, Amit Kapila wrote:

On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote:

On 05.12.2012 14:32, Amit Kapila wrote:

On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:

After some diversions to fix bugs and refactor existing code, I've
committed a couple of small parts of this patch, which just add some
sanity checks to notice incorrect PITR scenarios. Here's a new
version of the main patch based on current HEAD.

After testing with the new patch, the following problems are observed.

Defect - 1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. start another standby D following C.
5. Promote standby B.
6. After successful time line switch in cascade standby C& D,

stop D.

7. Restart D, Startup is successful and connecting to standby C.
8. Stop C.
9. Restart C, startup is failing.

Ok, the error I get in that scenario is:

C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not
contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05
19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit
code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to
startup process failure

That mismatch causes the error. I'd like to fix this by always treating
the checkpoint record to be part of the new timeline. That feels more
correct. The most straightforward way to implement that would be to peek
at the xlog record before updating replayEndRecPtr and replayEndTLI. If
it's a checkpoint record that changes TLI, set replayEndTLI to the new
timeline before calling the redo-function. But it's a bit of a
modularity violation to peek into the record like that.

Or we could just revert the sanity check at beginning of recovery that
throws the "requested timeline 2 does not contain minimum recovery point
0/3023F08 on timeline 1" error. The error I added to redo of checkpoint
record that says "unexpected timeline ID %u in checkpoint record, before
reaching minimum recovery point %X/%X on timeline %u" checks basically
the same thing, but at a later stage. However, the way
minRecoveryPointTLI is updated still seems wrong to me, so I'd like to
fix that.

I'm thinking of something like the attached (with some more comments
before committing). Thoughts?

This has fixed the problem reported.
However, I am not able to think will there be any problem if we remove
check
"requested timeline 2 does not contain minimum recovery point

0/3023F08 on timeline 1" at beginning of recovery and just update

replayEndTLI with ThisTimeLineID?

Well, it seems wrong for the control file to contain a situation like this:

pg_control version number: 932
Catalog version number: 201211281
Database system identifier: 5819228770976387006
Database cluster state: shut down in recovery
pg_control last modified: pe 7. joulukuuta 2012 17.39.57
Latest checkpoint location: 0/3023EA8
Prior checkpoint location: 0/2000060
Latest checkpoint's REDO location: 0/3023EA8
Latest checkpoint's REDO WAL file: 000000020000000000000003
Latest checkpoint's TimeLineID: 2
...
Time of latest checkpoint: pe 7. joulukuuta 2012 17.39.49
Min recovery ending location: 0/3023F08
Min recovery ending loc's timeline: 1

Note the latest checkpoint location and its TimelineID, and compare them
with the min recovery ending location. The min recovery ending location is
ahead of latest checkpoint's location; the min recovery ending location
actually points to the end of the checkpoint record. But how come the min
recovery ending location's timeline is 1, while the checkpoint record's
timeline is 2.

Now maybe that would happen to work if remove the sanity check, but it still
seems horribly confusing. I'm afraid that discrepancy will come back to
haunt us later if we leave it like that. So I'd like to fix that.

Mulling over this for some more, I propose the attached patch. With the
patch, we peek into the checkpoint record, and actually perform the timeline
switch (by changing ThisTimeLineID) before replaying it. That way the
checkpoint record is really considered to be on the new timeline for all
purposes. At the moment, the only difference that makes in practice is that
we set replayEndTLI, and thus minRecoveryPointTLI, to the new TLI, but it
feels logically more correct to do it that way.

This patch has already been included in HEAD. Right?

I found another "requested timeline does not contain minimum recovery point"
error scenario in HEAD:

1. Set up the master 'M', one standby 'S1', and one cascade standby 'S2'.
2. Shutdown the master 'M' and promote the standby 'S1', and wait for 'S2'
to reconnect to 'S1'.
3. Set up new cascade standby 'S3' connecting to 'S2'.
Then 'S3' fails to start the recovery because of the following error:

FATAL: requested timeline 2 does not contain minimum recovery
point 0/3000000 on timeline 1
LOG: startup process (PID 33104) exited with exit code 1
LOG: aborting startup due to startup process failure

The result of pg_controldata of 'S3' is:

Latest checkpoint location: 0/3000088
Prior checkpoint location: 0/2000060
Latest checkpoint's REDO location: 0/3000088
Latest checkpoint's REDO WAL file: 000000020000000000000003
Latest checkpoint's TimeLineID: 2
<snip>
Min recovery ending location: 0/3000000
Min recovery ending loc's timeline: 1
Backup start location: 0/0
Backup end location: 0/0

The content of the timeline history file '00000002.history' is:

1 0/3000088 no recovery target specified

I still could reproduce this problem. Attached is the shell script
which reproduces the problem.

This problem happens when new standby starts up from the backup
taken from another standby and its recovery starts from the shutdown
checkpoint record which causes timeline switch. In this case,
the timeline of minimum recovery point can be different from that of
latest checkpoint (i.e., shutdown checkpoint). But the following check
in StartupXLOG() assumes that they are always the same wrongly.
So the problem happens.

/*
* The min recovery point should be part of the requested timeline's
* history, too.
*/
if (!XLogRecPtrIsInvalid(ControlFile->minRecoveryPoint) &&
tliOfPointInHistory(ControlFile->minRecoveryPoint - 1, expectedTLEs) !=
ControlFile->minRecoveryPointTLI)
ereport(FATAL,
(errmsg("requested timeline %u does not contain minimum recovery
point %X/%X on timeline %u",
recoveryTargetTLI,
(uint32) (ControlFile->minRecoveryPoint >> 32),
(uint32) ControlFile->minRecoveryPoint,
ControlFile->minRecoveryPointTLI)));

If we don't have such check, in the later phase of recovery,
the minimum recovery point is initialized to the latest checkpoint
location as follows. This strikes me that the timeline of minimum
recovery point should be check after it's initialized. So ISTM that
the right fix of the problem is to move the above check after the
following initialization. Thought?

/* initialize minRecoveryPoint if not set yet */
if (XLByteLT(ControlFile->minRecoveryPoint, checkPoint.redo))
{
ControlFile->minRecoveryPoint = checkPoint.redo;
ControlFile->minRecoveryPointTLI = checkPoint.ThisTimeLineID;
}

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Fujii Masao (#85)
Re: Switching timeline over streaming replication

On 23.12.2012 16:37, Fujii Masao wrote:

On Fri, Dec 21, 2012 at 1:48 AM, Fujii Masao<masao.fujii@gmail.com> wrote:

On Sat, Dec 15, 2012 at 9:36 AM, Fujii Masao<masao.fujii@gmail.com> wrote:

I found another "requested timeline does not contain minimum recovery point"
error scenario in HEAD:

1. Set up the master 'M', one standby 'S1', and one cascade standby 'S2'.
2. Shutdown the master 'M' and promote the standby 'S1', and wait for 'S2'
to reconnect to 'S1'.
3. Set up new cascade standby 'S3' connecting to 'S2'.
Then 'S3' fails to start the recovery because of the following error:

FATAL: requested timeline 2 does not contain minimum recovery
point 0/3000000 on timeline 1
LOG: startup process (PID 33104) exited with exit code 1
LOG: aborting startup due to startup process failure

The result of pg_controldata of 'S3' is:

Latest checkpoint location: 0/3000088
Prior checkpoint location: 0/2000060
Latest checkpoint's REDO location: 0/3000088
Latest checkpoint's REDO WAL file: 000000020000000000000003
Latest checkpoint's TimeLineID: 2
<snip>
Min recovery ending location: 0/3000000
Min recovery ending loc's timeline: 1
Backup start location: 0/0
Backup end location: 0/0

The content of the timeline history file '00000002.history' is:

1 0/3000088 no recovery target specified

I still could reproduce this problem. Attached is the shell script
which reproduces the problem.

This problem happens when new standby starts up from the backup
taken from another standby and its recovery starts from the shutdown
checkpoint record which causes timeline switch. In this case,
the timeline of minimum recovery point can be different from that of
latest checkpoint (i.e., shutdown checkpoint). But the following check
in StartupXLOG() assumes that they are always the same wrongly.
So the problem happens.

/*
* The min recovery point should be part of the requested timeline's
* history, too.
*/
if (!XLogRecPtrIsInvalid(ControlFile->minRecoveryPoint)&&
tliOfPointInHistory(ControlFile->minRecoveryPoint - 1, expectedTLEs) !=
ControlFile->minRecoveryPointTLI)
ereport(FATAL,
(errmsg("requested timeline %u does not contain minimum recovery
point %X/%X on timeline %u",
recoveryTargetTLI,
(uint32) (ControlFile->minRecoveryPoint>> 32),
(uint32) ControlFile->minRecoveryPoint,
ControlFile->minRecoveryPointTLI)));

No, it doesn't assume that min recovery point is on the same timeline as
the checkpoint record. This is another variant of the "timeline history
files are not included in the backup" problem discussed on the other
thread with subject "pg_basebackup from cascading standby after timeline
switch". If you remove the min recovery point check above, the test case
still fails, with a different error message:

LOG: unexpected timeline ID 1 in log segment 000000020000000000000003,
offset 0

If you modify the test script to copy the 00000002.history file to the
data-standby3/pg_xlog after running pg_basebackup, the test case works.
(we still need to fix it, of course)

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers